Neo4j version: 4.2.3 enterprise
Driver: Python Neo4j driver (neo4j-driver 4.2.1)
Server: Single node (EC2 r5a.8xlarge) - 32vCPUs - 256 GB RAM - Amazon Linux 4.14.219-161.340.amzn2.x86_64
Heap: 31 GB
Pagecache: 203700m
Hello,
My company develops a web application and we're trying a Neo4j graph as the store. The model of the graph is something like what you see below. The data modelled is basically medical terms appeared in documents and those documents are associated to patients; the documents have different sections and a medical term appearing in one section or another is relevant. The queries we usually run are anchored in a medical concept node and traverse the graph in order to find out which documents have that concept appeared in specific sections and which patients are associated to those documents. Some volumetry of the graph is ~265M nodes and ~1.6B relationships.
Usual query we run:
MATCH (parent_concept:Concept {id: "a_concept_id"})
MATCH (parent_concept)<-[:IS_A*0..]-(concept:Concept)
MATCH (concept)<-[:MENTIONS_IN_SECTION_0|...|:MENTIONS_IN_SECTION_n]-(doc:Document)
MATCH (patient:Patient)<-[:APPLIES_TO]-(doc)
<do_aggregations_on_documents_and_patients>
RETURN <results_of_aggregations>
Something I've realized is that we have different kinds of medical concepts from a performance point of view, some concepts could be called normal nodes, others maybe super nodes. Given the limitation for new users of adding more than one attachment to a post I'll try to share the profile plans of one normal concept and two super node concepts in a further comment. The above query for the normal concept takes ~300ms and for the super node concepts can take up to several seconds.
We've thought of refactoring the graph model to "break" the super nodes by basically adding new artificial concept nodes so no original or new concept node has more than 20k incoming MENTIONS_IN... relationships, something like what you see in a further comment. Unfortunately running the query above on the refactored graph has the same bad performance.
Unfortunately I can't share any sample data I import or logs.
Anyway I'd appreciate the opinion of Neo4j experts on how to improve the performance of the query for those super node concepts but any other comments about how to optimize the query, improve the model, etc are welcome.
Thanks,