How to efficiently count all paths between two nodes in Cypher?

s.chatzopoulos · September 20, 2019, 10:55am

In a neo4j database, I have created 4 node types: A, B, C and D each one with 1500 nodes.

For each node a in A and d in D, I want to find the tuples (a, d, count) where count is the number of paths of type A->B->C->D that connect nodes a and d.

The number of relations between types are: :AB (22382), :BC (22388) and :CD (22387)

I have tried the following Cypher query:

CALL apoc.export.csv.query("MATCH (a:A)-[:AB]->(b:B)-[:BC]->(c:C)-[:CD]->(d:D) RETURN a.`_id`, d.`_id`, count(*)", "results.csv", {})

That seems to return correct results but needs ~40 seconds to execute. Adding one more relation to the path seems to dramatically increase the execution time.

The query plan is the following:

I have increased the ulimit in the server and set

dbms.memory.heap.initial_size=5100m
dbms.memory.heap.max_size=5100m
dbms.memory.pagecache.size=6900m

in neo4j configuration as neo4j-admin memrec suggests.

I use Neo4j 3.5.8 community edition. Without exporting to a file, the query runs in 32 seconds. The query returns 1,927,493 tuples in the form (source_node, target_node, count_of_paths). Each tuple counts only a few paths (for most of them less than 10). Note that when adding two more relations to the same query, the execution time is increased to approximately two hours.

The PROFILE can be found here:

My goal is to compare the performance of Neo4j to a linear algebra library like Eigen as the same result can be obtained with matrix multiplication of the adjacency matrices of AB, BC and CD. Note that this matrix multiplication with Eigen takes less that a second to execute.

Is there a way to optimize such a query?

andrew_bowman · September 20, 2019, 5:57pm

The planner seems to be doing a good amount of label scans node hash joins, and that doesn't seem to be the best approach here.

Since you already have the relationships between the nodes that force the types to be what you want (for example :CD will always connect a :C node to a :D node) you can leave the labels of the nodes out, and that should remove the possibility of the planner using separate label scans and hash joins.

Give this a try:

CALL apoc.export.csv.query("MATCH (a:A)-[:AB]->()-[:BC]->()-[:CD]->(d) WITH a, d, count(*) as totalPaths RETURN a.`_id`, d.`_id`, totalPaths", "results.csv", {})

s.chatzopoulos · September 21, 2019, 5:51pm

Indeed, that seems to help a little. In fact the query time dropped from ~37sec to ~30sec.
The query plan for the new query is the following:

I was expecting a greater speedup though.
Is there probably a better way to re-write the query ?
Also it is not clear to me if the memory allocated to the cache is sufficient.
The column Cache H/M has zero values everywhere. What does this mean ? Is this normal ?

andrew_bowman · September 23, 2019, 3:03am

You've provided the EXPLAIN plan but not the PROFILE plan. Please PROFILE the query so we can see the rows and db hits.

EXPLAIN does not run the query, which is why you're not seeing cache hit or miss info.

s.chatzopoulos · September 23, 2019, 7:08am

Yes, sorry I had in mind the PROFILE when writing the previous comment, but I haven't posted it.

I was talking about the Cache H/M column being zero for all rows. Is that normal ?
Is there a chance that the cache is disabled ?
I have set dbms.memory.pagecache.size=6900m in the neo4j.conf

Topic		Replies	Views
Slow cypher Cypher	7	513	June 23, 2021
Cypher query giving back multiple paths connecting the given multiple nodes (more than 2) Procedures & APOC cypher	3	530	April 7, 2022
Optimisation of a cypher query Cypher optimization	3	1083	January 29, 2019
How to make multi level path traversal faster Cypher apoc , performance , configuration , cypher	8	5177	December 4, 2019
Performance issue with shortestPath on cypher query Cypher performance , cypher	4	1186	May 21, 2020

July Summer Fun!

How to efficiently count all paths between two nodes in Cypher?

Related topics