Biomedical KG suggestions

Hi everyone,
I’m Meera, a research scholar currently working on building a biomedical KG using Neo4j. I’m new to this area and have been working on the project for the past 2 months I'd really appreciate any feedback or guidance on whether I’m heading in the right direction or if there are better approaches I should consider.

Current Status- Entity types: 20 , Relation types: 48, Nodes: 144,119, Relationships: 6,671,474 Platform:Neo4j (loading via Python scripts and Cypher)
I am trying to map each entity type to a standard ID system throughout and I have accessed data fro different sources like DrugBank, Biosnap, CTD, PharmgKB, TISSUE Db, STRING, HPO, STITCH, SIDER, Reactome etc...

I have some queries:

  • Mapping diverse IDs to unified identifiers is very time-consuming.Any tools, services, or workflows that help with this at scale?
  • I worry about duplicate relationships or inconsistent entity mappings.How do others handle KG validation or sanity checks for large graphs?
  • I’m currently building a general-purpose biomedical KG that could be used as ground truth. Would it be better to focus on building smaller, use-case–driven subgraphs first?
  • Should I be using indexes or constraints differently to improve query performance?

Many times, I find myself doubting whether I’m on the right path. I’d Appreciate honest feedback on whether I’m going in the right direction, suggestions on best practices, tools, or even papers I should look at and any tips from anyone who’s worked on similar biomedical graph projects.
Thank you so much in advance!
Meera

  • Mapping diverse IDs to unified identifiers is very time-consuming.Any tools, services, or workflows that help with this at scale?

Don't fully grasp what this means: what is your "mapping" process, what is the purpose of a "unified identifier", and how often will this take place

  • I worry about duplicate relationships or inconsistent entity mappings.How do others handle KG validation or sanity checks for large graphs?

You could do MERGE instead of CREATE when you are attempting to create graphs, this will "semantically" coalesce to the element you intend, minimising duplicates.

  • I’m currently building a general-purpose biomedical KG that could be used as ground truth. Would it be better to focus on building smaller, use-case–driven subgraphs first?

Probably, there's no simple answer for this

  • Should I be using indexes or constraints differently to improve query performance?

Yes, this will improve performance significantly

Thank you Josh for your reply.
By "mapping," I mean the process of converting various IDs or names used for the same entity into a single, standardized format. For example, in the case of the entity type Gene, I have collected relationships such as gene–haplotype, gene–disease, gene–chemical etc... from multiple databases. These databases often use different identifiers for the same gene—such as NCBI Geneid, UniProtid, HGNCid, or Ensemblid.

To avoid confusion and ensure consistency across the graph, I’ve mapped all gene identifiers to their corresponding UniProt Id. This allows the knowledge graph to represent each gene uniquely, no matter which data source it came from. This unified identifier makes the graph easier to interpret, query, and integrate with other tools or datasets.
As for how often this mapping happens: it was done during the initial integration phase, but I plan to repeat it whenever I add new data or update existing sources

For that, since it is infrequent and might require customisation, you're probably looking at some custom script that "transforms" the data before loading it to the graph.

Yes, I use a custom Python script to handle the mapping during data preprocessing, before loading into Neo4j. Since it’s not needed often, I run it only when adding or updating data. In the future, I may turn it into a small pipeline or utility to make it easier to reuse.