Database becomes unresponsive and reports large number of STW in debug.log

paul2 · November 13, 2019, 11:22pm

we are running a a multi merge query with a number of unions which ends up causing Stop The World errors to be reported, and then unresponsiveness.

2019-11-13 21:51:06.560+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=128, gcTime=14160, gcCount=3}
2019-11-13 21:51:06.661+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14100, gcTime=0, gcCount=0}
2019-11-13 21:51:20.896+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13965, gcTime=0, gcCount=0}
2019-11-13 21:51:35.158+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=111, gcTime=14156, gcCount=3}
2019-11-13 21:51:35.258+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14050, gcTime=0, gcCount=0}
2019-11-13 21:51:49.580+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14052, gcTime=0, gcCount=0}
2019-11-13 21:52:03.844+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13977, gcTime=0, gcCount=0}
2019-11-13 21:52:17.992+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Scheduled checkpoint for time threshold" @ txId: 656719 checkpoint started...
2019-11-13 21:52:18.154+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14017, gcTime=0, gcCount=0}
2019-11-13 21:52:18.184+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 656719 to [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.a], from [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.b].
2019-11-13 21:52:33.109+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Scheduled checkpoint for time threshold" @ txId: 656719 checkpoint completed in 15s 117ms
2019-11-13 21:52:33.109+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 2
2019-11-13 21:52:33.172+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14754, gcTime=0, gcCount=0}
2019-11-13 21:52:47.424+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13978, gcTime=0, gcCount=0}
2019-11-13 21:53:01.615+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13900, gcTime=0, gcCount=0}

We've isolated a seemingly expensive MERGE statement to this:

MATCH   (f:User)-[:SAID]-(r:Review {public:true})-[:ABOUT]->(t:Thing)<-[:ABOUT]-(c:Card {type:{type}}), 
        (f)-[fo:FRIEND_OF]->(u:User {bjid:{bjid}}) 
    WHERE (not exists(t.closed)) 
        and (size(c.bjid) = $size) 
        and (not (u)-[:FORGOT]->(t)) 
        and (not (u)-[:FORGOT]->(r)) 
        and (not (u)-[:DELETED]->(f)) 
        and (t.lng > $lnga) 
        and (t.lng < $lngb) 
        and (t.lat > $lata) 
        and (t.lat < $latb) 
    OPTIONAL MATCH  (t)-[:IS_TYPE]->(p:Type), 
                    (r)-[:TAGGED_AS]->(g:Tag), 
                    (t)-[:IS_IN]->(l:Location) 
    RETURN t.bjid as tid, t.name as name, t.googleid as gid, t.lng as lng, t.lat as lat, collect(p.name) as typs, collect(g.name) as tags, c.bjid as cardid, id(l) as locid, l.name as locname, fo.cats as ftrust, null as fftrust, 'none' as special, f.bjid as fid, r.text as insight, 2 as rank

with the parameters:
:params {bjid : "u00000130", lnga : -97.84595489497035,lngb : -97.58846282954046,lata : 30.107117887064263,latb : 30.441570071491142,type : "thing",none : 'none', size:6}

and we have the indexes:

Indexes
   ON :Alias(name) ONLINE  (for uniqueness constraint)
   ON :Card(bjid) ONLINE  (for uniqueness constraint)
   ON :Collection(bjid) ONLINE  (for uniqueness constraint)
   ON :Collection(tag) ONLINE  (for uniqueness constraint)
   ON :GooglePlaceType(name) ONLINE  (for uniqueness constraint)
   ON :Location(googleid) ONLINE  (for uniqueness constraint)
   ON :Tag(name) ONLINE  (for uniqueness constraint)
   ON :Thing(bjid) ONLINE  (for uniqueness constraint)
   ON :Thing(googleid) ONLINE  (for uniqueness constraint)
   ON :Type(name) ONLINE  (for uniqueness constraint)
   ON :User(bjid) ONLINE  (for uniqueness constraint)
   ON :User(mobile) ONLINE  (for uniqueness constraint)
   ON :Username(name) ONLINE  (for uniqueness constraint)

Constraints
   ON ( alias:Alias ) ASSERT alias.name IS UNIQUE
   ON ( card:Card ) ASSERT card.bjid IS UNIQUE
   ON ( collection:Collection ) ASSERT collection.bjid IS UNIQUE
   ON ( collection:Collection ) ASSERT collection.tag IS UNIQUE
   ON ( googleplacetype:GooglePlaceType ) ASSERT googleplacetype.name IS UNIQUE
   ON ( location:Location ) ASSERT location.googleid IS UNIQUE
   ON ( tag:Tag ) ASSERT tag.name IS UNIQUE
   ON ( thing:Thing ) ASSERT thing.bjid IS UNIQUE
   ON ( thing:Thing ) ASSERT thing.googleid IS UNIQUE
   ON ( type:Type ) ASSERT type.name IS UNIQUE
   ON ( user:User ) ASSERT user.bjid IS UNIQUE
   ON ( user:User ) ASSERT user.mobile IS UNIQUE
   ON ( username:Username ) ASSERT username.name IS UNIQUE

and the execution plan below:

benjamin.squire · November 13, 2019, 11:30pm

Can you increase the Heap size in the neo4j.conf file? That can be a simple fix if you still have space in Memory on your Machine

benjamin.squire · November 13, 2019, 11:40pm

These don't appear to be indexed. When you filter based on non-indexed properties I am not sure you are searching efficiently. Same goes for WHERE (not exists(t.closed))

Maybe try the following Indexes:

CREATE INDEX ON :Thing(lat)
CREATE INDEX ON :Thing(closed)

Also you may want to use a Label :Closed rather than a flag t.closed if that is only a binary flag as you can get better performance from Labels sometimes for binary properties like that. I.e. - WHERE ( NOT t:Closed)

paul2 · November 13, 2019, 11:53pm

I could but I’ve put the memory settings as recommended in mem-rec, if I’m going to do something else can you suggest how I work out what to set it to?

paul2 · November 13, 2019, 11:55pm

Thing.lat and Thing.lng are coordinations, hadn’t considered indexing in those at all but will try that and closed. Thanks.

paul2 · November 14, 2019, 3:42pm

@benjamin.squire can you recommend what we should set the heap size to or how we should calculate it?

machine has 16GB and memrec recommends:

dbms.memory.heap.initial_size=4900m
dbms.memory.heap.max_size=4900m
dbms.memory.pagecache.size=6100m

which is what we've got.

andrew_bowman · November 14, 2019, 4:33pm

Regarding the query plan, we can't make use of that with the plan elements collapsed. Please execute the PROFILE again, then use the double-down arrow in the lower-right to expand all plan elements, then add that here or replace the previous plan.

benjamin.squire · November 18, 2019, 6:26pm

Well you are using 10 GB on a 16 GB machine. You might be able to squeeze another 3 GB in heap initial and max_size that could help.

As Andrew stated without seeing the profile of exactly what is happening its hard to say how to optimize. Alternatively you could try and use a machine with more Ram.

Topic		Replies	Views
Stop the world pauses on every query and periodically Neo4j Graph Platform migrated	7	318	November 17, 2022
Unexpected "stop-the-world" pauses while Neo4J is inactive? Neo4j Graph Platform migrated	2	355	August 23, 2024
Neo4j periodically becoming unresponsive Neo4j Graph Platform performance	2	2294	November 11, 2019
Detected VM stop-the-world pause writing neo4j slows down Neo4j Graph Platform	3	2844	February 13, 2020
Database becomes non-responsive in production Neo4j Graph Platform performance	12	7577	January 6, 2020

Database becomes unresponsive and reports large number of STW in debug.log

Related topics