Optimisation of a cypher query

Hello,
we are doing database tests with different databases on connected data. Surprisingly Neo4J is somewhat slow compared to JDO Datanucleus. In our tests we have around 1.1 mio node, we aim for much bigger datasets ( around 10^12 nodes).

Here is a model of our datamodel (generated by apoc.meta.graph) with additional information on the number of nodes of a specific type. So we have around 1.1 mio nodes in our database.

Here is the query we are using:

MATCH (n:Artefact) WHERE NOT ()-[:Link]->(n)
WITH collect(n) AS sources_list, sinks_list
MATCH trace = (sources)-[:Link*0..9]->(sinks)
WHERE sinks in sinks_list and sources in sources_list
WITH [n IN nodes(trace)| n.neo4jIdentity] AS trace_nodes
RETURN  DISTINCT trace_nodes

This query takes 2min 8sec to execute.

Here is the execution plan:

We guess it would be helpfull to prevent the expand all for the link matching, but we are not sure how to do that.

The goal of the query is to find all paths up to a specified length between Artefacts that have no incoming edge of type Link and Artefacts that have no outgoing edge of type Link.

Additional information:
Neo4j Desktop 1.1.3
Querying tried from the Neo4J Browser and programatically with java using the Object Graph Mapper (Version 3.1.2).

Thanks in advance for your help.

Looks like the current query is using your sinks_list and sources_list for filtering, when one or the other should be used as a starting point instead. Only one of these lists needs to be collected, but you may need to experiment to see if which way is more efficient.

MATCH (sink:Artefact) 
WHERE NOT (sink)-[:Link]->() 
WITH collect(sink) as sinks_list
MATCH (source:Artefact) 
WHERE NOT ()-[:Link]->(source) 
MATCH trace = (source)-[:Link*0..9]->(sink) 
WHERE sink in sinks_list
WITH [n IN nodes(trace)| n.neo4jIdentity] AS trace_nodes 
RETURN DISTINCT trace_nodes

vs

MATCH (source:Artefact) 
WHERE NOT ()-[:Link]->(source) 
WITH collect(source) as source_list
MATCH (sink:Artefact) 
WHERE NOT (sink)-[:Link]->() 
MATCH trace = (source)-[:Link*0..9]->(sink) 
WHERE source in source_list
WITH [n IN nodes(trace)| n.neo4jIdentity] AS trace_nodes 
RETURN DISTINCT trace_nodes

If source and sink characteristics are not likely to change, you might also consider adding appropriate labels to each, that way you can do a :Source or :Sink label scan and avoid filtering for the nodes in question.

The proposed queries are using far less nodes in explain of expand all, but the execution time is similar in all cases +-3 seconds, depending on the run. So this looks good on paper, but has no practical value :(

If neo4jIdentity is unique per :Artefact node then you may not need the DISTINCT in your RETURN (I don't see how you would get identical paths with this query).

You may also want to see if labeling these nodes as :Source and :Sink (based upon the characteristics) and using them in your query helps at all (so you avoid having to filter for sources and sinks, you just do label scans instead). This would at least show if the majority of the time is being spent on the matching and filtering, or on the variable-length expansion and property access.

If property access is the bottleneck, then I suspect a custom procedure would work much faster here, especially if you use the caffeine caching library to cache the neo4jIdentity property of nodes so you can avoid redundant property access.

You should be able to run this small modified query which avoids property access and just uses the neo4j ids to test if the property access is the bottleneck here. What are the timings on this one comparatively?

MATCH (source:Artefact) 
WHERE NOT ()-[:Link]->(source) 
WITH collect(source) as source_list 
MATCH (sink:Artefact) 
WHERE NOT (sink)-[:Link]->() 
MATCH trace = (source)-[:Link*0..9]->(sink) 
WHERE source in source_list 
WITH [n IN nodes(trace)| id(n)] AS trace_nodes 
RETURN DISTINCT trace_nodes