Difference between these three CYPHER queries

Hi all,

I wanted to understand why ANY and apoc’s subGraph queries are faster than typical Neo4J query. I’ve a graph with some 200M nodes with few large node clusters (connected nodes). These large clusters have up-to ~5k nodes.

When I’m querying for an example large cluster, say cluster A with 5000 nodes and a “depth” spread till a depth of 5 from the queried node using this the below query, Neo4J is constantly timing out (more than 5 minutes)

MATCH (n: User {user_id: 'text-id'})-[r*0..5]-(m) RETURN n, r, m

I checked the docs and find out that I can possibly use the shortest path technique for my use case using the ANY keyword as docs claimed that it increases performance(which it did). I modified my query but noticed significant performance gains. It returned my graph with 5000 nodes in around 10 seconds

MATCH (n: User {user_id: 'text-id'})-[r*0..5]-(m) RETURN n, r, m

Next, I tried an APOC based approach and it gave the same results with a couple of seconds

MATCH (n: User {user_id: ‘text-id’}
CALL apoc.path.subgraphAll(n, maxLevel: 5, relationshipFilter:'>', bfs:true}) YIELD nodes as nodes_, relationships AS rels_
RETURN nodes_ AS nodes, rels AS relationsips

I got this APOC query from ChatGPT and wanted to understand how the APOC based query is so much faster than my original query method. Is it just because of the ‘>’ direction filter? I tried querying in a directional way in my original query too but even then it timedout? ALso why is query with ANY so fast?

All the nodes in my graph are bidirectionally connected i.e. if a node A has outgoing edge E1 to node B, then a corresponding edge E2 from B to A is also present.

Thanks in advance !

First, I think we’re missing the any() query you mentioned, please edit so we can see that query.

There are a few factors in play here, but the main differences are in what kind of results you are asking for and what kind of uniqueness behavior is being applied during the matching.

In your original query, Cypher is by default using a RELATIONSHIP_PATH uniqueness, where within a single path result a relationship can only be traversed once. Also, it is returning results from all possible paths (which fits well with your desired results of, per path, the start node, the end node, and all relationships for that path.

I’m not seeing your any() query so I can’t comment on that right now.

The apoc.path.subgraphAll() procedure is doing a few things differently than the original cypher, and because of this you’re going to have to be very clear of what kind of results you actually want from this, as the procedure’s results may not be what you want.

First, subgraphAll() (like subgraphNodes() and spanningTree()) is using NODE_GLOBAL uniqueness to find all distinct reachable nodes that comply with the filters and options…in this case only traversing outgoing relationships up to distance 5 from the starting node.

This kind of expansion is very fast with this NODE_GLOBAL uniqueness, as it doesn’t really care about keeping or expanding all possible paths; if a path is found that expands to an already-visited node, that path is pruned and does not need to be considered further. This kind of expansion is good for queries that need to get all possible connected nodes. It is not a good fit for queries that need to further consider the relationships (or relationship properties) or paths in further detail.

For subgraphAll(), once the distinct reachable nodes are found, then the procedure finds all possible relationships between that set of nodes, and then it returns the set of the nodes and the set of those relationships that connect between those nodes. If those results aren’t what you want, then this procedure is not a good fit for what you’re trying to do.

________

I hope that helps your understanding some.

In summary, the path expander procedure (while offering some capability beyond regular Cypher, as well as some limitations) gets improved performance because it doesn’t have to expand all paths (it prunes away redundant ones that lead to already-visited nodes), so it has to do far less work than regular cypher…but that only benefits certain use cases. If your use case isn’t a good fit, if the actual results you want require you to do that work, then you can’t use these procedures to get correct results.

In addition to @andrew_bowman comments about the differences between the two methods, there is also another difference. You stated that each pair of nodes (A, B) in your graph has both a relationship from A to B and one from B to A. This means that your cypher query that does not have a directional constraint has many many more possible paths to traverse, as there is a path from A to B and one from B to A. In the case of the subgrallAll procedure, you have added an outgoing relationship constraint; thus, the is only one path to traverse between each pair of nodes.

You should get many many more paths from the cypher query as written if were not to time out. You could try the following cypher to see if it performs better and is more aligned to what you are asking from subgraphAll. Keep in mind, as @andrew_bowman mentioned, the cypher will give all the full paths, while the subgraphAll returns a set of nodes and relationships that make up the subgraph.

MATCH (n: User {user_id: 'text-id'})-[r*0..5]->(m) RETURN n, r, m

Sorry, I missed the any keyword in my ANY query. Here’s the correct query

MATCH ANY (n: User {user_id: 'text-id'})-[r*0..5]-(m) RETURN n, r, m

The directional query which you shared proved to be quite slow for my use-case when compared with the APOC query. Here’s the exact queries which I’m using:

// APOC

MATCH (n: User {user_id: ‘text-id’}
CALL apoc.path.subgraphAll(n, maxLevel: 5, relationshipFilter:'>', bfs:true}) YIELD nodes as nodes_, relationships AS rels_
RETURN nodes_ AS nodes, rels AS relationsips

// CYPHER

MATCH p=((n: User {user_id: 'text-id'})-[r*0..5]->(m))
WITH nodes(p) AS nodes_, relationships(p) AS rel_
UNWIND nodes_ AS nodes__
UNWIND nodes__ AS _nodes
UNWIND rel_ AS rel__
UNWIND rel__ AS _rel
RETURN COLLECT(DISTINCT _nodes) AS nodes, COLLECT(DISTINCT _rel) AS relationships

I’ve tried to summarise performance of these two queries

Nodes and edges which should be returned Time taken by APOC query Time taken by single direction query
501 nodes, 1100 edges 2 seconds (returns 501 nodes, 1100 edges -- no data loss) 4 seconds (returns 501 nodes but only 1050 edges)
1501, 3300 edges 4 seconds (returns 1501 nodes, 3300 edges -- no data loss) 45 seconds (returns 1501 nodes but only 3150 edges)
4501, 9900 edges 7 seconds (returns 4501 nodes, 9900 edges -- no data loss) Takes way too long to execute (>300 secs)

As you can see, APOC based querying is way faster.

This is exactly what I need. With this I get all the nodes connected to the queried node at a depth of 5 and all the edges connecting them.

For subgraphAll(), once the distinct reachable nodes are found, then the procedure finds all possible relationships between that set of nodes, and then it returns the set of the nodes and the set of those relationships that connect between those nodes. If those results aren’t what you want, then this procedure is not a good fit for what you’re trying to do.

If you don’t need the individual paths, then the APOC method will be much more efficient. It works by traversing the subgroup and collecting the nodes and relationships. It does not maintain a list of all the individual paths, which will require time and memory as the subgraph size increases. Also, you are performing multiple unwinds, which is creating a large cross-product of duplicate data, which will become slower as the subgraph size increases.

As you correctly found out, the APOC subgraph is the solution for your use case.

P.S. Not sure you need the double unwind of the nodes and relationship, as the first unwind of the path ‘p’ will return a list of nodes or relationships. As such, I think you just need an unwind of the nodes and on for the relationships. The following may be faster, as it avoids the cross product of the unwinds of the nodes and relationships together. Although, there is an extra collect step upfront.

MATCH p=((n: User {user_id: 'text-id'})-[r*0..5]->(m))
WITH collect(nodes(p)) AS nodes_, collect(relationships(p)) AS rels_
call (nodes_) {
    unwind nodes_ as node_
    unwind node_ as node
    return collect(distinct node) as nodes
}
call (rels_) {
    unwind rels_ as rel_
    unwind rel_ as rel
    return collect(distinct rel) as relationships
}
RETURN nodes, relationships

You are right, one UNWIND is good enough for directional queries. Thanks !