Biomedical KG suggestions

meera_p230593cs · July 10, 2025, 6:23am

Hi everyone,
I’m Meera, a research scholar currently working on building a biomedical KG using Neo4j. I’m new to this area and have been working on the project for the past 2 months I'd really appreciate any feedback or guidance on whether I’m heading in the right direction or if there are better approaches I should consider.

Current Status- Entity types: 20 , Relation types: 48, Nodes: 144,119, Relationships: 6,671,474 Platform:Neo4j (loading via Python scripts and Cypher)
I am trying to map each entity type to a standard ID system throughout and I have accessed data fro different sources like DrugBank, Biosnap, CTD, PharmgKB, TISSUE Db, STRING, HPO, STITCH, SIDER, Reactome etc...

I have some queries:

Mapping diverse IDs to unified identifiers is very time-consuming.Any tools, services, or workflows that help with this at scale?
I worry about duplicate relationships or inconsistent entity mappings.How do others handle KG validation or sanity checks for large graphs?
I’m currently building a general-purpose biomedical KG that could be used as ground truth. Would it be better to focus on building smaller, use-case–driven subgraphs first?
Should I be using indexes or constraints differently to improve query performance?

Many times, I find myself doubting whether I’m on the right path. I’d Appreciate honest feedback on whether I’m going in the right direction, suggestions on best practices, tools, or even papers I should look at and any tips from anyone who’s worked on similar biomedical graph projects.
Thank you so much in advance!
Meera

joshcornejo · July 10, 2025, 6:32am

Mapping diverse IDs to unified identifiers is very time-consuming.Any tools, services, or workflows that help with this at scale?

Don't fully grasp what this means: what is your "mapping" process, what is the purpose of a "unified identifier", and how often will this take place

I worry about duplicate relationships or inconsistent entity mappings.How do others handle KG validation or sanity checks for large graphs?

You could do MERGE instead of CREATE when you are attempting to create graphs, this will "semantically" coalesce to the element you intend, minimising duplicates.

I’m currently building a general-purpose biomedical KG that could be used as ground truth. Would it be better to focus on building smaller, use-case–driven subgraphs first?

Probably, there's no simple answer for this

Should I be using indexes or constraints differently to improve query performance?

Yes, this will improve performance significantly

meera_p230593cs · July 10, 2025, 7:22am

Thank you Josh for your reply.
By "mapping," I mean the process of converting various IDs or names used for the same entity into a single, standardized format. For example, in the case of the entity type Gene, I have collected relationships such as gene–haplotype, gene–disease, gene–chemical etc... from multiple databases. These databases often use different identifiers for the same gene—such as NCBI Geneid, UniProtid, HGNCid, or Ensemblid.

To avoid confusion and ensure consistency across the graph, I’ve mapped all gene identifiers to their corresponding UniProt Id. This allows the knowledge graph to represent each gene uniquely, no matter which data source it came from. This unified identifier makes the graph easier to interpret, query, and integrate with other tools or datasets.
As for how often this mapping happens: it was done during the initial integration phase, but I plan to repeat it whenever I add new data or update existing sources

joshcornejo · July 10, 2025, 7:45am

For that, since it is infrequent and might require customisation, you're probably looking at some custom script that "transforms" the data before loading it to the graph.

meera_p230593cs · July 10, 2025, 8:03am

Yes, I use a custom Python script to handle the mapping during data preprocessing, before loading into Neo4j. Since it’s not needed often, I run it only when adding or updating data. In the future, I may turn it into a small pipeline or utility to make it easier to reuse.

Topic		Replies	Views
Extremely slow import for large graph database using neo4j-admin import Import / Export	3	2293	November 5, 2020
Neo4j Live: Entity Resolution and Deduplication with Neo4j and GenAI Conferences, Meetups, & Events	0	144	March 1, 2024
Deviation in the graph while generating kg Cypher performance , knowledge-base	0	18	July 25, 2025
Unique node for different properties Newbie Questions cypher	11	499	May 17, 2022
Does relationships get automatically generated between nodes? Neo4j Graph Platform migrated	2	257	August 7, 2022

Demystifying Neo4j UX Research

Biomedical KG suggestions

Related topics