Variable path with Label is very slow(2nd post)

The first post did not correctly explain the problem.
I tested it again and obtained results that shows a significant difference.
I expected that the query with the label filter would be faster than the one without the filter. However the actual result contradicts to my guess and furthermore, showing a big difference in performance: 38269 ms VS 170 ms

< with Label filter >
profile
MATCH path = (startNode)-[*]-(connectedNode:Sentence)
WHERE id(startNode) = 580
UNWIND relationships(path) as relationship
WITH startNode(relationship) AS startNode, endNode(relationship) AS endNode, relationship
RETURN startNode, endNode, relationship limit 5000

Cypher version: , planner: COST, runtime: PIPELINED. 83520 total db hits in 38269 ms.

< without label filter >
profile
MATCH path = (startNode)-[*]-(connectedNode)
WHERE id(startNode) = 580
UNWIND relationships(path) as relationship
WITH startNode(relationship) AS startNode, endNode(relationship) AS endNode, relationship
RETURN startNode, endNode, relationship limit 5000

Cypher version: , planner: COST, runtime: PIPELINED. 60863 total db hits in 170 ms

That cartesian product in the plan is killing the performance. Since it results in 10 rows, that is 10x the number of var-length expansions performed. The plan would be better without that operator.

Is this a test db, or is this the actual db? I would assume that as the number of :Sentence nodes increases, the planner would be dissuaded from using a label scan there and doing the cartesian product.

Can you let us know the version of Neo4j you are using? We are interested if you still see this in the latest patch release of the branch you're on (4.4.x, or 5.x).

One way to nudge this is to collect the end nodes and use a filter like this:

MATCH (connectedNode:Sentence)
WITH collect(connectedNode) as connectedNodes
MATCH path = (startNode)-[*]-(connectedNode)
WHERE id(startNode) = 580
AND connectedNode IN connectedNodes
...

Hi Andrew, thank you for the answer. Sorry for the delay in my response. I had forgotten to check the reply for this issue.

Now I tested it on EE 5.10.0. again and found two things:

  1. The graph of the results returned are different:
    The graph of my query with Label is most complex.
    The graph of my query without Label is the least complex.
    The graph of your query is a little more complex than the 2nd.
    How can the 1st graph be much more complex than the 2nd or 3rd one though they have the same limit of 5000?

  2. The time to return the result of 1st query is under 1 second for the limit up to 3000 but it took 24 secs for the limit of 4000 and 54 secs for the limit of 5000. It took so much time beyond the limit of 3000. So can the performance be improved by increasing a resource or changing a setting of Neo4j server configuration?

No problem, let's get into this a bit.

While you may not get the same answers (as you have no ORDER BY to make the results determinant), each of the answers is correct for the query asked...the planner just has multiple choices in how to fulfill this, and the different choices it makes can lead to a different result set since you're limiting the results.

Patterns in your query often present multiple different ways in which they can be solved. The planner attempts to pick what it thinks is the best plan, but metadata metrics don't always capture everything it needs to know (and there is always tuning that we can do to improve on our side too), so sometimes it chooses a plan that isn't ideal. Looking at the EXPLAIN plans (or PROFILE if we are able to run them) can help us understand how the planner intends to solve the query.

In this case, the planner wants to also lookup connectedNode:Sentence by label, but doing so creates a cartesian product operation, resulting in 10 rows (each one a pairing of a distinct :Sentence node and the exact same startNode). The behavior of the rest is to find, per row, all possible paths that exist between that pair of nodes on that row.

Without the label in the pattern, a label scan won't be used, and the planner decides to only use the startNode as the only anchor node, so the expansion out is only performed for that single row. It will find all possible paths that can exist, of any length (even if they loop back to already-visited nodes, as long as you're not repeating any relationships per path) as long as the end node is a :Sentence node.

(Note this means that when you UNWIND the paths of all relationships that occur in all the paths found, you will definitely have duplicates (the same relationship may be used in an entirely different path), so be aware of that)


I get a strong feeling that what you want of the query is very different than what you are asking the query to do.

What you are asking in that query (throughout the course of it) is to find all possible paths of any length as long as they start at the startNode and end in a :Sentence node (with a LIMIT controlling how many are returned). You will naturally be traversing the exact same relationships, not per path, but across all the possible distinct paths that it finds.

You would be surprised how many possible distinct paths can exist even in a tiny graph. In the movies graph that you can play with using :play movies, there are 171 nodes and 253 relationships. Distinct paths in this graph (where the only restriction is that a relationship cannot be traversed more than once per path) can stretch into the 10s of millions or higher.

If that behavior is not really what you want the query to do, then the real question is: what is it you ACTUALLY want the query to do? Once we understand your use case better, then we could change the query so it uses an efficient means to better match your requirements.