Cypher query is taking too long when trying to find shared nodes between two separate nodes

Hello I have a project that contains 109k nodes and 2.8 million connections. I have a place nodes, Visited edges and census block nodes. I am trying to find the top 10 places that have the most shared census block group nodes within a certain month.

Match (n:Place)<-[r:Visited{month: "April"}]-(c:CensusBG)-[b:Visited{month: "April"}]->(m:Place) 

return n.name, m.name, count(distinct c) as cnt

order by cnt desc limit 10

I tried running it over night but it still did not work and I had to restart my machine to get the database back up again.

This is a very large graph-wide query.

First, I think we'd recommend running this on Neo4j 4.3.x, and creating a relationship index on :Visted(month):

https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/#administration-indexes-create-a-single-property-index-for-relationships

Also since your pattern is symmetric, it would be better to add a filtering so you don't see symmetric results (where it's the same two places and the same count, with just the places swapped for n and m).

You can add WHERE id(n) < id(m), that should do the trick.

That said, if the issues you're facing are heap-related, it would be best to use subqueries to divide up the work, which should make aggregations less of a threat to your heap. Something like this maybe:

MATCH (n:Place)
CALL {
 WITH n
 MATCH (n)<-[r:Visited{month: "April"}]-(c:CensusBG)-[b:Visited{month: "April"}]->(m:Place) 
 WHERE id(n) < id(m)
 RETURN n.name as name1, m.name as name2, count(c) as cnt
}
RETURN name1, name2, cnt
ORDER BY cnt DESC
LIMIT 10

This ensures the aggregations are performed with respect to only the matches from a single starting :Place node at a time, instead of executing over all possible paths in the graph, which was likely blowing your heap.

1 Like