Database becomes non-responsive in production

We're experiencing problems with neo4j database on our new servers,
We've installed 3.4.12 causal cluster on aws,
after restart the server works fine for half an hour or so but the starts to hang and the the system becomes unresponsive until restart,
when the situation happens, the log fills with:
2019-03-18 10:58:39.820+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=855, gcTime=0, gcCount=0}

If anyone has any experience with solving this, please help :)
Thanks,
N.

We had issues with long STW pauses on enterprise neo4j running on aws on an i3.8xlarge instance. This occurred even with little pressure on system memory or cpu usage.

We were using OpenJDK. We switched to Oracle JDK and have found far fewer long STW pauses garbage collection. I believe the garbage collection in Oracle's proprietary JDK is superior to the open source version (at least for neo4j). I think we still have some tuning to do of the JVM parameters, but this certainly helped us.

$ java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

You may also see this if you are executing poorly tuned queries.

What kind of queries are running at the time? Does this correspond with data load? And do you have appropriate indexes and/or unique constraints? Have you done any query profiling to ensure expected indexes are being used?

An example of a poorly tuned query that might result in something like this, if you have a load query using MATCH or MERGE where indexes are not in place, as data load continues, this would be doing label scans instead of index seeks when performing the MERGE, so successive MERGEs in the file and in general as more and more data is loaded would get slower and slower and cause more heap memory to be used, which would have to be released afterward and could cause larger GC events.

Thanks, well try to add this, doe I think It's a bigger issue then the JDK as it takes a short time to become unresponsive and we have currently enough resources on the machine to keep all data in the memory.

Hi @andrew.bowman,
We are profiling our queries constantly,
The same query logic ran on the previous configuration just fine, without those issues,
BTW, our database is currently 2g and the page size is 11g.

Can you propose some troubleshooting techniques that we could call on the servers to try to understand what is the cause?

Thanks
N.

If your database is only 2g, then you may only need enough page cache to cover it with a bit more for expected growth. If you're not planning on any major ingestion, 3-5g ought to be fine.

How much do you have allocated to the heap? You can use neo4j-admin memrec to get recommendation on memory configuration given your available memory.

Thanks @andrew.bowman,
I've already implemented memrec in the configurations,
I've also managed to simulate the situation in a staging environment with a single query,
What is very confusing to us is that the total db hits for the query is ~4000 with a lot of pagecache hits, which is exactly as we planned it to be.

The query plan summery is:
Cypher version: CYPHER 3.4, planner: COST, runtime: SLOTTED. 4212 total db hits in 69 ms.

We can notice high CPU on the server when running the query even when no one else is running anything on the cluster

The query:


PROFILE
MATCH (p:Profile)-[:TopFollow]->(tf)
WHERE p.public_user_id = "9E4452ED-C77E-4E42-9164-F5BF00CE7E6D"
WITH *
MATCH (p)-[:Follow]->(fp:Profile)-[:SourceProfile_2019_03_31|:SourceProfile_2019_03_30|:SourceProfile_2019_03_29|:SourceProfile_2019_03_28|:SourceProfile_2019_03_27|:SourceProfile_2019_03_26|:SourceProfile_2019_03_25|:SourceProfile_2019_03_24|:SourceProfile_2019_03_23|:SourceProfile_2019_03_22]->(f:FeedItem)
WHERE (exists(fp.is_blocked) = false OR fp.is_blocked = false) AND
      f.status = 0 AND
      NOT f:Premium AND
      (f.time < datetime() OR fp.is_verified OR fp.is_approved) AND
      fp.is_freeze = false AND f.time < datetime()
WITH *
ORDER BY f.time DESC
LIMIT 100
WITH *
OPTIONAL MATCH (f)-[:SharedFeedItem]->(sf:FeedItem)<-[source]-(p)
WHERE type(source) STARTS WITH 'SourceProfile'
WITH *
WHERE sf IS NULL
WITH *
WHERE NOT EXISTS((f)<-[:Commented_on]-(tf))
WITH f.public_id as public_id, f.time as time, fp.public_user_id as public_user_id, f.status as status, f.type_id as type_id
RETURN public_id, time, public_user_id, status, type_id, 'social_following_post' as feed_classification
ORDER BY time DESC
LIMIT 100

The Plan:

Any ideas will be much appreciated

Hi @noam.k!

Did you get to solve this?
Same problems here. =/

Thanks!

Hi @oliveiracdz,
Unfortunately no, We've just implemented a lot of cache mechanism all over the system to hide the problem until we'll fix it

Same, we started seeing a lot of these errors when we move to 3.4.14 and the app becomes completely unresponsive after a little load (less than half of the load it used to be able to handle!)

@andrew.bowman any ideas?

Hi everyone!

Like @sam, i'm in the version 3.4.1 and everything works fine!
When I upgraded to version 3.5.6 the database became unresponsive with a lot of "stop-the-world pause" logs.

Could you open a separate issue for this? Logs or log excerpts can help, as well as knowing your memory heap/pagecache configuration. Also, make sure you've PROFILEd your queries and ensured that they're using expected indexes for lookup. Good idea to check if the query plans look optimized at the same time.

The more details on the new issue, the better.