Hello I have a project that contains 109k nodes and 2.8 million connections. I have a place nodes, Visited edges and census block nodes. I am trying to find the top 10 places that have the most shared census block group nodes within a certain month.
Match (n:Place)<-[r:Visited{month: "April"}]-(c:CensusBG)-[b:Visited{month: "April"}]->(m:Place)
return n.name, m.name, count(distinct c) as cnt
order by cnt desc limit 10
I tried running it over night but it still did not work and I had to restart my machine to get the database back up again.
This is a very large graph-wide query.
First, I think we'd recommend running this on Neo4j 4.3.x, and creating a relationship index on :Visted(month)
:
https://neo4j.com/docs/cypher-manual/current/administration/indexes-for-search-performance/#administration-indexes-create-a-single-property-index-for-relationships
Also since your pattern is symmetric, it would be better to add a filtering so you don't see symmetric results (where it's the same two places and the same count, with just the places swapped for n and m).
You can add WHERE id(n) < id(m)
, that should do the trick.
That said, if the issues you're facing are heap-related, it would be best to use subqueries to divide up the work, which should make aggregations less of a threat to your heap. Something like this maybe:
MATCH (n:Place)
CALL {
WITH n
MATCH (n)<-[r:Visited{month: "April"}]-(c:CensusBG)-[b:Visited{month: "April"}]->(m:Place)
WHERE id(n) < id(m)
RETURN n.name as name1, m.name as name2, count(c) as cnt
}
RETURN name1, name2, cnt
ORDER BY cnt DESC
LIMIT 10
This ensures the aggregations are performed with respect to only the matches from a single starting :Place node at a time, instead of executing over all possible paths in the graph, which was likely blowing your heap.
1 Like