Optimizing simple queries for very large graph DB

cct · February 28, 2024, 8:33pm

We have a very large graph db, with 1.6 billion Nodes and 8.6 billion Relationships, and have been trying to make simple Cypher queries run in non-geologic time.

The most important query we need to optimize is "given a specific node, how many incoming relationships does the node have?" In Cypher, something like:
MATCH (n:syrup)-[r:syrup_to_syrup]->(c:syrup {syrup_id: 'S999'}) return count(r)

I've tried using "limit" but it's not unusual for the c node to have several hundred thousand incoming edges, and the queries stop returning (or at least start to get immeasureably slow) with limits over 250k or so.

Also I looked at the relationship count store (using "size" instead of "count") but that's been deprecated.

So, here are my questions:

Would you consider our database an outlier in terms of size? Can you say if it's too big for neo4j?
Do you have any suggestions for optimizing this query?
Do you know how large queries affect the neo4j clients? My desktop client goes dog slow after a couple queries and I need to restart it. My python client will typically give up the ghost completely.

Looking forward to your suggestions!
Best,
-Mike

glilienfield · February 28, 2024, 10:02pm

Is it possible to have other types of nodes other than 'syrup' nodes related to a 'syrup_to_syrup' node? I doesn't seem so based on your relationship type. If true, true not specifying the label of the target node. This will eliminate it needing to load the other node to check its labels.

MATCH ()-[r:syrup_to_syrup]->(c:syrup {syrup_id: 'S999'}) 
return count(r)

maybe it helps.

dana_canzano · February 28, 2024, 10:05pm

@cct

a. what version of Neo4j?

b. there is metadata on each node which holds the # of incomin/outgoing relationhips by type and direction.
For example pre v5 one can run

match (c:syrup {syrup_id: 'S999'})  
return size (   (c)-[:syrup_to_syrup]->()  );

and this will report the number of relationships on node :syrup {syrup_id: 'S999'} and the relationship type is named :syrup_to_syrup and the relationship direction is outgoing.

For v5 the equivalent is

match (c:syrup {syrup_id: 'S999'})  
return count  {   (c)-[:syrup_to_syrup]->()   };

note if you add a label to the other side of the expression, i.e.

match (c:syrup {syrup_id: 'S999'})  
return count  {   (c)-[:syrup_to_syrup]->(n:syrup)   };

the metadata is not consulted and thus we need to iterate 1 by 1 for every syrup_to_syrup relationship and then check if the destination node has a label named :syrup

cct · February 28, 2024, 10:08pm

Thank you for your suggestion. My experience with Cypher queries on our db has been that, in general, queries take a lot longer when you don't specify categories. But I like your logic and will give it a shot. thanks again!

cct · February 28, 2024, 10:27pm

Thank you for your reply. We are running neo4j 5.13.0 Enterprise.

I believe you when you say there is metadata on each node for what I'm looking for, but I haven't been able to crack it in a performant fashion. (i.e. I've gone up to limit 250k in queries but larger than that gets to be immeasurably slow.)

I did try the pre v5 size() operator, but was told it was deprecated.

With respect to specifying categories, I would actually prefer not to! But I've found that queries without categories go much more slowly. Ideally, yeah, I'd just want something like:

MATCH (c:syrup {syrup_id: 'S999'})
RETURN COUNT ( ()-->(c) )

or

MATCH (c:syrup {syrup_id: 'S999'})
RETURN COUNT ( ()-[]->(c) )

(Note that I'm looking to query the number incoming not outgoing nodes.)

I'll give it a shot. Thanks again for your help!
-Mike

dana_canzano · February 28, 2024, 11:37pm

@cct

yes for v5 do not use size ( (s)-[r] ->() ) instead use count { (s)-[r] ->() }

note thats count { } and not count ( ). i.e. for this case its count squiggly brace and not count paren

cct · February 28, 2024, 11:39pm

While I was reading you guys' replies, I was also running the following:

match ()-[r:syrup_to_syrup]->(c:syrup {syrup_id:'S9999'}) return count(r)

It finished successfully after about 65 minutes.
(The answer was 24757502 incoming connections.)

Then I followed dana's suggestion and ran:

match (c:syrup {syrup_id:'S9999'}) return count { ()-[:syrup_to_syrup]->(c) }

and it finished in 288ms!

Of course, the previous results were probably cached. So right now I'm fishing around in the graph for another node with an ultra high number of incoming connections. (The vaaaaast majority of nodes have only single-digit incoming edges.) I'll keep y'all posted.

cct · February 28, 2024, 11:51pm

I found a node with 2795842 incoming relationships and ran dana's query on it.
It took 239107 ms, which is a significant improvement. Thanks, dana!

I'm going to work with my chief DS and our engineers and see if ~5 minutes is fast enough for our purposes. In the meantime, thanks again!

-Mike

dana_canzano · February 29, 2024, 12:23am

@cct

your comment of

match (c:syrup {syrup_id:'S9999'}) return count { ()-[:syrup_to_syrup]->(c) }

and it finished in 288ms!

Of course, the previous results were probably cached.

We do not cache query results. We cache query plans and data the graph may be recorded in RAM but again its just the graph.

If you preface the query with profile and thus

profile match (c:syrup {syrup_id:'S9999'}) return count { ()-[:syrup_to_syrup]->(c) }

what this does is produce the query execution plan and then runs the query.
The query execution plan will first find the node in question, i.e c:syrup {syrup_id:'S9999'})
If you do not have an index on :syrup(syrup_id) then if you have 100k :syrup nodes then it will need to examine each of the 100k :syrup nodes to see which node(s) have a syrup_id='S99999'. If you have an inde in :syrup(syryp_id) then its a significant smalller number of nodes to check.
After if finds the node(s) with syrup_id='S99999' the next block of the query plan should invoked a GetNodeDegree. this GetNodeDegree is the key that its getting its data from the node metadata

you next post asking if 5 mins is fast enough. Provided you have an index I see no reason why this shouldnt be seconds

cct · February 29, 2024, 12:38am

Thanks for the drilldown!

I said ~5 minutes because that's how long it took (239107ms) for the other node with 2,795,842 incoming relationships.

And oh yes we have definitely indexed :syrup(syrup_id). That was one of our earliest lessons!

Thanks again!
-Mike

dana_canzano · February 29, 2024, 1:23am

@cct

for the query that takes 5 minutes, if you preface it with profile does the query plan report a block that includes a reference to GetDegree?

Can you share the query plan?

cct · February 29, 2024, 2:08am

Planner COST

Runtime PIPELINED

Runtime version 5.13

Batch size 128

+-----------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+-----------+---------------------+
| Operator        | Id | Details                                                                                              | Estimated Rows | Rows | DB Hits | Memory (Bytes) | Page Cache Hits/Misses | Time (ms) | Pipeline            |
+-----------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+-----------+---------------------+
| +ProduceResults |  0 | cnt                                                                                                  |              1 |    1 |       0 |              0 |                        |           |                     |
| |               +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+                        |           |                     |
| +Projection     |  1 | getDegree((c)<-[:concept_to_concept]-()) + getDegree((c)<-[:concept_to_concept_unlabeled]-()) AS cnt |              1 |    1 |       2 |                |                        |           |                     |
| |               +----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+                        |           |                     |
| +NodeIndexSeek  |  2 | TEXT INDEX c:concept(id) WHERE id = $autostring_0                                                    |              1 |    1 |       2 |            248 |             828670/587 |  1744.357 | Fused in Pipeline 0 |
+-----------------+----+------------------------------------------------------------------------------------------------------+----------------+------+---------+----------------+------------------------+-----------+---------------------+

Total database accesses: 4, total allocated memory: 312

dana_canzano · February 29, 2024, 2:19am

@cct
thanks. query plan looks as good as it is going to get.

Except we have moved on from :syrup to :concept and all along i was dreaming of :syrup and ;)

Topic		Replies	Views
Hola, Graph DBers! Introduce-Yourself	2	178	February 29, 2024
Performance query over millions of relationships Cypher	2	2573	January 31, 2020
How to optimize the query Cypher performance , cypher	0	447	February 6, 2020
New Ph D student exploring graph query optimization Introduce-Yourself	2	633	January 7, 2021
1000 queries takes time(as expected), how can i approach this this in a better way Neo4j Graph Platform performance , cypher	3	343	November 24, 2023

Get Certified in June!

Optimizing simple queries for very large graph DB

Related topics