In a neo4j database, I have created 4 node types: A, B, C and D each one with 1500 nodes.
For each node a in A and d in D, I want to find the tuples (a, d, count) where count is the number of paths of type A->B->C->D that connect nodes a and d.
The number of relations between types are: :AB (22382), :BC (22388) and :CD (22387)
I have tried the following Cypher query:
CALL apoc.export.csv.query("MATCH (a:A)-[:AB]->(b:B)-[:BC]->(c:C)-[:CD]->(d:D) RETURN a.`_id`, d.`_id`, count(*)", "results.csv", {})
That seems to return correct results but needs ~40 seconds to execute. Adding one more relation to the path seems to dramatically increase the execution time.
The query plan is the following:
I have increased the ulimit in the server and set
dbms.memory.heap.initial_size=5100m
dbms.memory.heap.max_size=5100m
dbms.memory.pagecache.size=6900m
in neo4j configuration as neo4j-admin memrec
suggests.
I use Neo4j 3.5.8 community edition. Without exporting to a file, the query runs in 32 seconds. The query returns 1,927,493 tuples in the form (source_node, target_node, count_of_paths). Each tuple counts only a few paths (for most of them less than 10). Note that when adding two more relations to the same query, the execution time is increased to approximately two hours.
The PROFILE can be found here:
My goal is to compare the performance of Neo4j to a linear algebra library like Eigen as the same result can be obtained with matrix multiplication of the adjacency matrices of AB, BC and CD. Note that this matrix multiplication with Eigen takes less that a second to execute.
Is there a way to optimize such a query?