Why is this query too slow?

graphene · September 13, 2023, 1:13am

Neo4j v5, (paying for professional)

I have a graph of :User nodes that can have various different relationship types connecting them.
I would like to find the paths between :User nodes that have > x number of total :User nodes.

Here is my query (thanks ChatGPT!)

MATCH path = (u1:User)-[*..]-(u2:User)
WITH path, [node in nodes(path) WHERE node:User] as userNodes
WHERE size(apoc.coll.toSet(userNodes)) > 5
RETURN path

This looks good to me, but it's so slow that the query never finishes (in Neo4j desktop and web browser).

My db has about 500K nodes but I suspect there are only a dozen paths that have more than 5 :User nodes.

Is there a way to query for this info that will perform better?

andrew_bowman · September 13, 2023, 9:16pm

The query you have will find EVERY possible path between EVERY possible combination from all :User nodes in your graph, and the only limitation is that within a path each relationship can only be traversed once (but nodes can be revisited freely). You don't have any upper bound defined. Even with a small graph, like the movies graph with only 171 nodes, the number of such paths returned can span into the billions.

I think you really need to think about what kinds of criteria you want to have here (types of relationships, directions of relationships, upper bounds, etc), as this is just far too much.

andrew_bowman · September 13, 2023, 9:27pm

To give some concrete examples, using that same 171 node / 253 relationship movies graph that you can get with :play movies, if we ran this query:

match (:Person)-[*1..5]-(:Person)
return count(*)

we get 34582 paths.

If we bump up the upper bound to lengths of 6, now we're up to 573590 distinct paths. At upper bound of 8, it's 11679308 paths, and it will likely become infeasible from there. And this graph is TINY compared to your 500k node graph.

It seems to me that this kind of query isn't really what you want, so it will be good idea to rethink what it is you are actually after.

graphene · September 13, 2023, 9:46pm

Thank you, that's very helpful!
Is there an efficient way to find the n largest subgraphs in the db, as measured by number of nodes, or number of nodes with the :User label?

My db is full of many small clusters of users connected together, with a few anomalous clusters that are much bigger than the others, and I'd like to find those larger clusters/subgraphs.

andrew_bowman · September 14, 2023, 3:50am

Okay, that sounds like you're not interested in paths at all (and this is part of the performance problem, Cypher is trying to find permutations of distinct paths but that's not what you actually want), but looking for disconnected sets of subgraphs instead, and you want to do something with these depending on the number of :User nodes per subgraph?

You may want to make use of the Graph Data Science library. Once you project your graph in memory, you can use the Weakly Connected Components algorithm to find these clusters (called 'components' according to the algorithm).

You would want to first do a graph projection with all the nodes and relationships to be considered, then run the wcc algorithm with the nodes and relationships involved, and the algorithm will yield nodeId, componentId. You can collect the nodeIds() grouping by the componentId (each component will represent one of these connected subgraphs), and filter on the nodes to get the counts of nodes with just the labels you're interested in, then filter or sort or whatever you want to do from there.

For more in-depth advice on using this, you can ask here:

graphene · October 2, 2023, 8:38pm

Thank you! I didn't realize I would need that, I'll sign up for AuraDS now

Topic		Replies	Views
Query all the paths to one node with one kind of relationship neo4j Cypher cypher	9	4517	September 9, 2020
Cypher query for getting a subgraph by multiple relationship paths Cypher	8	2658	May 27, 2022
Subgraph query in graphDB Cypher cypher	3	315	September 25, 2021
Help re-writing query Cypher performance , cypher	7	1052	March 11, 2019
Optimizing Subgraph Queries for Low Latency API calls Cypher apoc , performance , cypher , cloud	9	1626	October 28, 2018

July Summer Fun!

Why is this query too slow?

Related topics