Hi. I am using Neo4j Community 4.0.4.
I have encountered this issue using the offical Bolt driver for Python, but it is also completely reproducible in the Neo4j browser (version 4.0.7).
I have a very simple graph for now, consisting of the following node and relationship types:
(:Document)-[:contains]->(:Block)
(:Block)<-[:prev]-(:Block)-[:next]->(:Block)
There are only 75 nodes in my entire test database for now - 1 Document node and 74 Block nodes.
Running the following Cypher statement brings the CPU to 100% and the memory utilization rises indefinitely, after which I have to kill the session:
match (d:Doc{name: 'doc name'})
optional match (d)-[*]-(n)
return d,n
I also got the Java heap size error one time.
It only starts to work if I set a strict upper bound on the relationship or specify the direction, e.g.:
optional match (d)-[*..5]->(n)
Considering that I am doing a strictly local graph traversal starting from the Document node and my data set is very small, how can this be happening?
From the symptoms it appears that the engine simply does not keep track of which nodes were already visited ... Am I missing something?
Hi @gene.mishchenko,
Looks like your second query has typo mistake. []--(n) ---------------> []->(n)
match (d:Doc{name: 'doc name'})
optional match (d)-[*]--(n)
return d,n
Thanks, that was just the typo in the post... Fixed... The actual question still remains.
Kindly check loop into your graph
To clarify, this isn't a loop problem. It's an issue of there being a high (limited, but high...millions or billions or higher) number of possible distinct paths when you don't add any restrictions on the types of relationships, the direction, or the upper bounds of the var-length pattern.
Cypher is interested in finding all possible distinct paths that match a pattern. There's nothing that prevents it from revisiting nodes you've visited previously, expanding through them from the same node (but using different relationships) or expanding through previously visited nodes when expanding other paths. The only restriction is that, per path, it cannot traverse the same relationship twice.
If you're interested in getting all distinct nodes, and cutting off traversal and path evaluation if you revisit a node you've visited before, then you may want to leverage APOC Procedures, notably the path expander procs.
In this case:
match (d:Doc{name: 'doc name'})
CALL apoc.path.subgraphNodes(d, {}) YIELD node as n
return d, n
subgraphNodes() and some of the other procs use a different kind of uniqueness behavior during traversal, such that a node can only be visited once total, so any other paths that encounter an already visited node are pruned and won't be further evaluated or expanded.
1 Like
Thanks, Andrew.
For some reason I assumed that Cypher will just dynamically switch from the path uniqueness semantics to the node uniqueness semantics because the operation following the match deals only with nodes and not with relationships.
In my case I have the disconnected sub-graphs that represent the output of NLP. They are tens of thousands of nodes each and are relatively dense.
This came up in the context of trying to delete a (:Doc) and everything that's connected to it before re-loading a new version of the sub-graph into Neo4j:
disconnect delete d, n
I see this cleanup activity as a very common operational task that many people may have in their use cases... Do you think it may be a good idea to include a basic DFS/BFS node traversal implementation in core Cypher when only the nodes are used downstream?
(the Graph Data Science library offers some heavy artillery that can help as well, but I think installing and managing extensions for something this simple is an overkill)
Hi Andrew,
For my better understanding please explain why this cannot be the problem of circuit /loop?
Sure.
Cypher uses a certain kind of uniqueness called relationship isomorphism, which means that a path cannot traverse the same relationship twice. More detail here:
A condition for an infinite loop when exploring a path is that the same relationships must be traversable without limit per path. Cypher's relationship isomorphism prevents this, so no infinite loops are possible with Cypher alone. (if you use a custom proc, such as one that uses the traversal API and changes the uniqueness behavior, then infinite loops can become a possibility based on what behavior is selected, but that's not the case here, and the APOC procs suggested use an even tighter form of uniqueness that would still prevent infinite loops).
We do have some Cypher changes on our backlog for allowing varied behaviors such as what you were expecting when we're after distinct nodes. No timeline for these, but we do have our eyes on this.
You can actually get some benefit already by adding DISTINCT in the WITH or RETURN that follows the variable relationship pattern, but it does require an upper bound to be present, which isn't the case in the original query.
Thanks a lot, What I overlooked is uni-directional edge between Doc and Block.
You are right it will very be a loop with below query
match (d:Doc{name: 'doc name'})
optional match (d)--(n)
return d,n
With below model
(:Document)-[:contains]->(:Block)
(:Block)<-[:prev]-(:Block)-[:next]->(:Block)
Maybe I misunderstood you here, but this won't result in a loop.
If Cypher was allowed to reuse relationships in a path, then it would loop...but it's not allowed to do so, so looping cannot happen.
As noted above, it's not about an infinite loop, but about the sheer number of all possible distinct paths of any length that can be formed from a traversal without bounds or predicates to restrain what it can traverse (excepting that a path can't reuse a relationship it's used before). We're likely talking about tens of millions to billions of paths requiring evaluation and return.