Optimisation of a cypher query

smuessemeier · January 28, 2019, 2:44pm

Hello,
we are doing database tests with different databases on connected data. Surprisingly Neo4J is somewhat slow compared to JDO Datanucleus. In our tests we have around 1.1 mio node, we aim for much bigger datasets ( around 10^12 nodes).

Here is a model of our datamodel (generated by apoc.meta.graph) with additional information on the number of nodes of a specific type. So we have around 1.1 mio nodes in our database.

Here is the query we are using:

MATCH (n:Artefact) WHERE NOT ()-[:Link]->(n)
WITH collect(n) AS sources_list, sinks_list
MATCH trace = (sources)-[:Link*0..9]->(sinks)
WHERE sinks in sinks_list and sources in sources_list
WITH [n IN nodes(trace)| n.neo4jIdentity] AS trace_nodes
RETURN  DISTINCT trace_nodes

This query takes 2min 8sec to execute.

Here is the execution plan:

We guess it would be helpfull to prevent the expand all for the link matching, but we are not sure how to do that.

The goal of the query is to find all paths up to a specified length between Artefacts that have no incoming edge of type Link and Artefacts that have no outgoing edge of type Link.

Additional information:
Neo4j Desktop 1.1.3
Querying tried from the Neo4J Browser and programatically with java using the Object Graph Mapper (Version 3.1.2).

Thanks in advance for your help.

andrew_bowman · January 28, 2019, 4:24pm

Looks like the current query is using your sinks_list and sources_list for filtering, when one or the other should be used as a starting point instead. Only one of these lists needs to be collected, but you may need to experiment to see if which way is more efficient.

MATCH (sink:Artefact) 
WHERE NOT (sink)-[:Link]->() 
WITH collect(sink) as sinks_list
MATCH (source:Artefact) 
WHERE NOT ()-[:Link]->(source) 
MATCH trace = (source)-[:Link*0..9]->(sink) 
WHERE sink in sinks_list
WITH [n IN nodes(trace)| n.neo4jIdentity] AS trace_nodes 
RETURN DISTINCT trace_nodes

vs

MATCH (source:Artefact) 
WHERE NOT ()-[:Link]->(source) 
WITH collect(source) as source_list
MATCH (sink:Artefact) 
WHERE NOT (sink)-[:Link]->() 
MATCH trace = (source)-[:Link*0..9]->(sink) 
WHERE source in source_list
WITH [n IN nodes(trace)| n.neo4jIdentity] AS trace_nodes 
RETURN DISTINCT trace_nodes

If source and sink characteristics are not likely to change, you might also consider adding appropriate labels to each, that way you can do a :Source or :Sink label scan and avoid filtering for the nodes in question.

smuessemeier · January 29, 2019, 10:24am

The proposed queries are using far less nodes in explain of expand all, but the execution time is similar in all cases +-3 seconds, depending on the run. So this looks good on paper, but has no practical value :(

andrew_bowman · January 29, 2019, 6:10pm

If neo4jIdentity is unique per :Artefact node then you may not need the DISTINCT in your RETURN (I don't see how you would get identical paths with this query).

You may also want to see if labeling these nodes as :Source and :Sink (based upon the characteristics) and using them in your query helps at all (so you avoid having to filter for sources and sinks, you just do label scans instead). This would at least show if the majority of the time is being spent on the matching and filtering, or on the variable-length expansion and property access.

If property access is the bottleneck, then I suspect a custom procedure would work much faster here, especially if you use the caffeine caching library to cache the neo4jIdentity property of nodes so you can avoid redundant property access.

You should be able to run this small modified query which avoids property access and just uses the neo4j ids to test if the property access is the bottleneck here. What are the timings on this one comparatively?

MATCH (source:Artefact) 
WHERE NOT ()-[:Link]->(source) 
WITH collect(source) as source_list 
MATCH (sink:Artefact) 
WHERE NOT (sink)-[:Link]->() 
MATCH trace = (source)-[:Link*0..9]->(sink) 
WHERE source in source_list 
WITH [n IN nodes(trace)| id(n)] AS trace_nodes 
RETURN DISTINCT trace_nodes

Topic		Replies	Views
Optimizing a very simple (but expensive) query that takes hours to complete Cypher	6	1275	June 11, 2020
How to efficiently count all paths between two nodes in Cypher? Cypher apoc	4	4104	September 23, 2019
How to make multi level path traversal faster Cypher apoc , performance , configuration , cypher	8	5177	December 4, 2019
Query to return all possible paths return infinite no of paths Cypher performance , paths	12	906	March 7, 2022
Cypher query slow performance Cypher cypher	5	544	November 12, 2023

July Summer Fun!

Optimisation of a cypher query

Related topics