Hello, our team is currently doing an implementation using Neo4j and we are facing some issues.
Short description
Our structure is relatively simple:
- the nodes have only two properties of interest: id(int) and seed(bool) and both are Indexed
- there's only one relation type between nodes (CREATES).
What we are trying to achieve is getting the shortest path like so:
CALL apoc.cypher.runTimeboxed("MATCH p = ShortestPath((t:Entity {id: toInteger('1')})<-[:CREATES*]-(s:Entity{seed: true}))
WITH p LIMIT 10
WITH collect(p) as paths CALL apoc.convert.toTree(paths) YIELD value RETURN value;", {}, 15000)
We have 3 - 10 services which call the above in total roughly 10-20 times per second.
The neo4j pod currently has 8 cores assigned and 32gbs of memory.
The problem we are facing:
After pods are running for a while, the CPU starts getting maxed out and everything slows down and we start seeing almost all transactions coming out as "execution expired". After forcing a restart on Neo4j, everything goes back to normal, and CPU usage drops from 8 core maxed out to ~2 cores.
What we have tried:
When transactions start to fail we pinpoint the group of pods with the issue , grab the queries that are failing, we connect directly to the pod and execute the failing queries manually. The failing queries when ran manually take around 30 - 100ms to finish.
We also run a PROFILE against the queries and the Memory consumption is around 300kbs, most of the time there are no DB scans and when those exist are on the single digits.
Has anyone faced the above or has any pointers to what might be happening?
Edit: So far the biggest Neo4j DB we have is 120k Nodes, 4 Million relationships.
However we have seen the problem happening also with 75k Nodes, 4 Million relationships.
Our graph depth is usually focused around 5 and doesn't usually go past 10 but data is dynamic that can change. We are expecting an upper limit of 20.
Edit2: we have now seen it start failing with 3k entities and 22k relationships.