We're using Neo4j as a read model in an event sourced system. It's job is basically to process a stream of events and build a database for query purposes.
The events are translated into cypher statements and sent to new4j using the bolt driver, in batches of 1000-5000 statements per transaction. (sequentially, single threaded)
This has been working great up to neo4j 3.5.16, with neo4j processing +/- 1000 events per second when rebuilding from scratch.
Now, we just updated to neo4j 4.2.1, and I'm seeing following strange behaviour when doing a full rebuild (i.e. starting from an empty DB and processing all events, approx 3 million in total)
It basically keeps going at this rate from then on.
Now the strange part:
It's looks like neo4j isn't able to keep up somehow, but after waiting a while, all is back to normal.
What could be causing this?
Things I tried:
The logs do not contain any related information.
Anything in your I/O behavior?
Can you turn that into a reproducible test?
Did you test it on another system? E.g. linux / windows / older OSX ?
Anything with transaction handling that changed? Or closing of sessions/transactions?
Do you use UNWIND $array for your batches ? or sending 1k-5k statements in individually?
I deployed this on our debian servers last week (AWS), and the behaviour is exactly the same. So it doesn't seem related to anything environment-specific.
I don't use UNWIND to combine statements as each event stands on its own, there is no guarantee that consecutive events are similar. It's possible that UNWIND is used to translate a particular type of event, but that won't include a whole lot of data. So yes 1-5k events are sent individually.
One thing I tried however is playing with the cypher.min_replan_interval property.
The default of 10s seemed suspicious, as that's exactly the time I need to wait before things get back to normal.
Increasing that setting also seemed to increased the time needed to wait.
One theory I came up with so far:
It's just a wild guess, as I don't know enough about neo4j internals to confirm it.
Could this make sense?
Yes that sounds sensible.
You should also wait for the indexes/constraints to be online with
btw. for keys it's better to use constraints than indexes.
One think you could do is to invalidate the plan cache manually with a procedure call.
There are also some prefix-settings for forcing or skipping replanning as of 4.2
@michael.hunger I did try out your suggestions:
So it looks like replanning doesn't work as it should ... ?
I'm currently manually calling
db.clearQueryCaches() more regularly than I'd like during a 3 hour batched insert job but at least it mostly works. Each time I call it it fixes the issue for a shorter time so occasionally I have restart the database container and pray the job doesn't fall over.
I'm only inserting nodes with 10 or so properties, and no edges during this pass, so I'm not doing anything complex, unless the 1st party abstraction layers are doing something strange.
For anyone else who's interested, here's the GitHub issue that was raised by Christian but closed by the dev team: performance issue after upgrade 3.5 => 4.2 · Issue #12721 · neo4j/neo4j · GitHub
We're still using a workaround by pausing the job for a few seconds or calling clearQueryCaches(). As a rebuild is pretty rare, it's not that big of an issue for us.
I didn't get to making a reproducible scenario yet as requested in the github issue. We have so many parameters that could lead to this behaviour that it would probably be difficult to isolate and reproduce.
And hoping that the issue will eventually disappear in a future release...