We have a very large graph db, with 1.6 billion Nodes and 8.6 billion Relationships, and have been trying to make simple Cypher queries run in non-geologic time.
The most important query we need to optimize is "given a specific node, how many incoming relationships does the node have?" In Cypher, something like:
MATCH (n:syrup)-[r:syrup_to_syrup]->(c:syrup {syrup_id: 'S999'}) return count(r)
I've tried using "limit" but it's not unusual for the c node to have several hundred thousand incoming edges, and the queries stop returning (or at least start to get immeasureably slow) with limits over 250k or so.
Also I looked at the relationship count store (using "size" instead of "count") but that's been deprecated.
So, here are my questions:
Would you consider our database an outlier in terms of size? Can you say if it's too big for neo4j?
Do you have any suggestions for optimizing this query?
Do you know how large queries affect the neo4j clients? My desktop client goes dog slow after a couple queries and I need to restart it. My python client will typically give up the ghost completely.
Is it possible to have other types of nodes other than 'syrup' nodes related to a 'syrup_to_syrup' node? I doesn't seem so based on your relationship type. If true, true not specifying the label of the target node. This will eliminate it needing to load the other node to check its labels.
MATCH ()-[r:syrup_to_syrup]->(c:syrup {syrup_id: 'S999'})
return count(r)
b. there is metadata on each node which holds the # of incomin/outgoing relationhips by type and direction.
For example pre v5 one can run
match (c:syrup {syrup_id: 'S999'})
return size ( (c)-[:syrup_to_syrup]->() );
and this will report the number of relationships on node :syrup {syrup_id: 'S999'} and the relationship type is named :syrup_to_syrup and the relationship direction is outgoing.
For v5 the equivalent is
match (c:syrup {syrup_id: 'S999'})
return count { (c)-[:syrup_to_syrup]->() };
note if you add a label to the other side of the expression, i.e.
match (c:syrup {syrup_id: 'S999'})
return count { (c)-[:syrup_to_syrup]->(n:syrup) };
the metadata is not consulted and thus we need to iterate 1 by 1 for every syrup_to_syrup relationship and then check if the destination node has a label named :syrup
Thank you for your suggestion. My experience with Cypher queries on our db has been that, in general, queries take a lot longer when you don't specify categories. But I like your logic and will give it a shot. thanks again!
Thank you for your reply. We are running neo4j 5.13.0 Enterprise.
I believe you when you say there is metadata on each node for what I'm looking for, but I haven't been able to crack it in a performant fashion. (i.e. I've gone up to limit 250k in queries but larger than that gets to be immeasurably slow.)
I did try the pre v5 size() operator, but was told it was deprecated.
With respect to specifying categories, I would actually prefer not to! But I've found that queries without categories go much more slowly. Ideally, yeah, I'd just want something like:
MATCH (c:syrup {syrup_id: 'S999'})
RETURN COUNT ( ()-->(c) )
or
MATCH (c:syrup {syrup_id: 'S999'})
RETURN COUNT ( ()-[]->(c) )
(Note that I'm looking to query the number incoming not outgoing nodes.)
I'll give it a shot. Thanks again for your help!
-Mike
While I was reading you guys' replies, I was also running the following:
match ()-[r:syrup_to_syrup]->(c:syrup {syrup_id:'S9999'}) return count(r)
It finished successfully after about 65 minutes.
(The answer was 24757502 incoming connections.)
Then I followed dana's suggestion and ran:
match (c:syrup {syrup_id:'S9999'}) return count { ()-[:syrup_to_syrup]->(c) }
and it finished in 288ms!
Of course, the previous results were probably cached. So right now I'm fishing around in the graph for another node with an ultra high number of incoming connections. (The vaaaaast majority of nodes have only single-digit incoming edges.) I'll keep y'all posted.
match (c:syrup {syrup_id:'S9999'}) return count { ()-[:syrup_to_syrup]->(c) }
and it finished in 288ms!
Of course, the previous results were probably cached.
We do not cache query results. We cache query plans and data the graph may be recorded in RAM but again its just the graph.
If you preface the query with profile and thus
profile match (c:syrup {syrup_id:'S9999'}) return count { ()-[:syrup_to_syrup]->(c) }
what this does is produce the query execution plan and then runs the query.
The query execution plan will first find the node in question, i.e c:syrup {syrup_id:'S9999'})
If you do not have an index on :syrup(syrup_id) then if you have 100k :syrup nodes then it will need to examine each of the 100k :syrup nodes to see which node(s) have a syrup_id='S99999'. If you have an inde in :syrup(syryp_id) then its a significant smalller number of nodes to check.
After if finds the node(s) with syrup_id='S99999' the next block of the query plan should invoked a GetNodeDegree. this GetNodeDegree is the key that its getting its data from the node metadata
you next post asking if 5 mins is fast enough. Provided you have an index I see no reason why this shouldnt be seconds