Refactoring graph model to overcome super node problem

Neo4j version: 4.2.3 enterprise
Driver: Python Neo4j driver (neo4j-driver 4.2.1)
Server: Single node (EC2 r5a.8xlarge) - 32vCPUs - 256 GB RAM - Amazon Linux 4.14.219-161.340.amzn2.x86_64
Heap: 31 GB
Pagecache: 203700m

Hello,

My company develops a web application and we're trying a Neo4j graph as the store. The model of the graph is something like what you see below. The data modelled is basically medical terms appeared in documents and those documents are associated to patients; the documents have different sections and a medical term appearing in one section or another is relevant. The queries we usually run are anchored in a medical concept node and traverse the graph in order to find out which documents have that concept appeared in specific sections and which patients are associated to those documents. Some volumetry of the graph is ~265M nodes and ~1.6B relationships.

graph_model_20210406

Usual query we run:

MATCH (parent_concept:Concept {id: "a_concept_id"})
MATCH (parent_concept)<-[:IS_A*0..]-(concept:Concept)
MATCH (concept)<-[:MENTIONS_IN_SECTION_0|...|:MENTIONS_IN_SECTION_n]-(doc:Document)
MATCH (patient:Patient)<-[:APPLIES_TO]-(doc)
<do_aggregations_on_documents_and_patients>
RETURN <results_of_aggregations>

Something I've realized is that we have different kinds of medical concepts from a performance point of view, some concepts could be called normal nodes, others maybe super nodes. Given the limitation for new users of adding more than one attachment to a post I'll try to share the profile plans of one normal concept and two super node concepts in a further comment. The above query for the normal concept takes ~300ms and for the super node concepts can take up to several seconds.

We've thought of refactoring the graph model to "break" the super nodes by basically adding new artificial concept nodes so no original or new concept node has more than 20k incoming MENTIONS_IN... relationships, something like what you see in a further comment. Unfortunately running the query above on the refactored graph has the same bad performance.

Unfortunately I can't share any sample data I import or logs.

Anyway I'd appreciate the opinion of Neo4j experts on how to improve the performance of the query for those super node concepts but any other comments about how to optimize the query, improve the model, etc are welcome.

Thanks,

Profile plans following in next comments:

  • Normal concept

  • Super node concept 1

  • Super node concept 2

  • Graph model refactor

As a first approach, I would install the APOC plugin in the database and use this procedure:

CALL apoc.meta.stats()

You will have a quick overview of your data scalability and will know how and where to optimise. Somethings wrong with the way you are querying the related relationships to your concepts. Is it possible to have the queries generating these plans, feel free to replace literal value with fake names if needed.

Note: APOC might be already installed if you are using a cloud service to host Neo4j.