Database becomes non-responsive in production

performance
(Noam K) #1

We're experiencing problems with neo4j database on our new servers,
We've installed 3.4.12 causal cluster on aws,
after restart the server works fine for half an hour or so but the starts to hang and the the system becomes unresponsive until restart,
when the situation happens, the log fills with:
2019-03-18 10:58:39.820+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=855, gcTime=0, gcCount=0}

If anyone has any experience with solving this, please help :)
Thanks,
N.

0 Likes

(Simon) #2

We had issues with long STW pauses on enterprise neo4j running on aws on an i3.8xlarge instance. This occurred even with little pressure on system memory or cpu usage.

We were using OpenJDK. We switched to Oracle JDK and have found far fewer long STW pauses garbage collection. I believe the garbage collection in Oracle's proprietary JDK is superior to the open source version (at least for neo4j). I think we still have some tuning to do of the JVM parameters, but this certainly helped us.

$ java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)
0 Likes

(Andrew Bowman) #3

You may also see this if you are executing poorly tuned queries.

What kind of queries are running at the time? Does this correspond with data load? And do you have appropriate indexes and/or unique constraints? Have you done any query profiling to ensure expected indexes are being used?

An example of a poorly tuned query that might result in something like this, if you have a load query using MATCH or MERGE where indexes are not in place, as data load continues, this would be doing label scans instead of index seeks when performing the MERGE, so successive MERGEs in the file and in general as more and more data is loaded would get slower and slower and cause more heap memory to be used, which would have to be released afterward and could cause larger GC events.

0 Likes

(Noam K) #4

Thanks, well try to add this, doe I think It's a bigger issue then the JDK as it takes a short time to become unresponsive and we have currently enough resources on the machine to keep all data in the memory.

0 Likes

(Noam K) #5

Hi @andrew.bowman,
We are profiling our queries constantly,
The same query logic ran on the previous configuration just fine, without those issues,
BTW, our database is currently 2g and the page size is 11g.

Can you propose some troubleshooting techniques that we could call on the servers to try to understand what is the cause?

Thanks
N.

0 Likes

(Andrew Bowman) #6

If your database is only 2g, then you may only need enough page cache to cover it with a bit more for expected growth. If you're not planning on any major ingestion, 3-5g ought to be fine.

How much do you have allocated to the heap? You can use neo4j-admin memrec to get recommendation on memory configuration given your available memory.

0 Likes

(Noam K) #7

Thanks @andrew.bowman,
I've already implemented memrec in the configurations,
I've also managed to simulate the situation in a staging environment with a single query,
What is very confusing to us is that the total db hits for the query is ~4000 with a lot of pagecache hits, which is exactly as we planned it to be.

The query plan summery is:
Cypher version: CYPHER 3.4, planner: COST, runtime: SLOTTED. 4212 total db hits in 69 ms.

We can notice high CPU on the server when running the query even when no one else is running anything on the cluster

The query:


PROFILE
MATCH (p:Profile)-[:TopFollow]->(tf)
WHERE p.public_user_id = "9E4452ED-C77E-4E42-9164-F5BF00CE7E6D"
WITH *
MATCH (p)-[:Follow]->(fp:Profile)-[:SourceProfile_2019_03_31|:SourceProfile_2019_03_30|:SourceProfile_2019_03_29|:SourceProfile_2019_03_28|:SourceProfile_2019_03_27|:SourceProfile_2019_03_26|:SourceProfile_2019_03_25|:SourceProfile_2019_03_24|:SourceProfile_2019_03_23|:SourceProfile_2019_03_22]->(f:FeedItem)
WHERE (exists(fp.is_blocked) = false OR fp.is_blocked = false) AND
      f.status = 0 AND
      NOT f:Premium AND
      (f.time < datetime() OR fp.is_verified OR fp.is_approved) AND
      fp.is_freeze = false AND f.time < datetime()
WITH *
ORDER BY f.time DESC
LIMIT 100
WITH *
OPTIONAL MATCH (f)-[:SharedFeedItem]->(sf:FeedItem)<-[source]-(p)
WHERE type(source) STARTS WITH 'SourceProfile'
WITH *
WHERE sf IS NULL
WITH *
WHERE NOT EXISTS((f)<-[:Commented_on]-(tf))
WITH f.public_id as public_id, f.time as time, fp.public_user_id as public_user_id, f.status as status, f.type_id as type_id
RETURN public_id, time, public_user_id, status, type_id, 'social_following_post' as feed_classification
ORDER BY time DESC
LIMIT 100

The Plan:

Any ideas will be much appreciated

0 Likes