Refactoring graph model to overcome super node problem

serpear · April 6, 2021, 4:37pm

Neo4j version: 4.2.3 enterprise
Driver: Python Neo4j driver (neo4j-driver 4.2.1)
Server: Single node (EC2 r5a.8xlarge) - 32vCPUs - 256 GB RAM - Amazon Linux 4.14.219-161.340.amzn2.x86_64
Heap: 31 GB
Pagecache: 203700m

Hello,

My company develops a web application and we're trying a Neo4j graph as the store. The model of the graph is something like what you see below. The data modelled is basically medical terms appeared in documents and those documents are associated to patients; the documents have different sections and a medical term appearing in one section or another is relevant. The queries we usually run are anchored in a medical concept node and traverse the graph in order to find out which documents have that concept appeared in specific sections and which patients are associated to those documents. Some volumetry of the graph is ~265M nodes and ~1.6B relationships.

Usual query we run:

MATCH (parent_concept:Concept {id: "a_concept_id"})
MATCH (parent_concept)<-[:IS_A*0..]-(concept:Concept)
MATCH (concept)<-[:MENTIONS_IN_SECTION_0|...|:MENTIONS_IN_SECTION_n]-(doc:Document)
MATCH (patient:Patient)<-[:APPLIES_TO]-(doc)
<do_aggregations_on_documents_and_patients>
RETURN <results_of_aggregations>

Something I've realized is that we have different kinds of medical concepts from a performance point of view, some concepts could be called normal nodes, others maybe super nodes. Given the limitation for new users of adding more than one attachment to a post I'll try to share the profile plans of one normal concept and two super node concepts in a further comment. The above query for the normal concept takes ~300ms and for the super node concepts can take up to several seconds.

We've thought of refactoring the graph model to "break" the super nodes by basically adding new artificial concept nodes so no original or new concept node has more than 20k incoming MENTIONS_IN... relationships, something like what you see in a further comment. Unfortunately running the query above on the refactored graph has the same bad performance.

Unfortunately I can't share any sample data I import or logs.

Anyway I'd appreciate the opinion of Neo4j experts on how to improve the performance of the query for those super node concepts but any other comments about how to optimize the query, improve the model, etc are welcome.

Thanks,

anthony_gatlin · August 23, 2022, 12:44am

@serpear , I would be interested to know how you finally solved this problem or at least what techniques you used to minimize the performance impact of the super nodes.

Topic		Replies	Views
Refactoring graph model to overcome super node problem Neo4j Graph Platform performance	5	357	April 6, 2021
Graph Modeling: All About Super Nodes Neo4j Developer Blog Archive	1	940	December 28, 2020
Request for Tutorial on Hanlding Super Nodes Feedback & Requests	1	441	September 28, 2021
Super nodes performance issue while running community detection algorithms Procedures & APOC	11	2086	April 4, 2019
Traversing that involve super node Cypher	7	589	September 27, 2021

Take the Course Then Join The Aura Agent Hackathon

Refactoring graph model to overcome super node problem

Related topics

Take the Course Then Join
The Aura Agent Hackathon