Database becomes unresponsive and reports large number of STW in debug.log

we are running a a multi merge query with a number of unions which ends up causing Stop The World errors to be reported, and then unresponsiveness.

2019-11-13 21:51:06.560+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=128, gcTime=14160, gcCount=3}
2019-11-13 21:51:06.661+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14100, gcTime=0, gcCount=0}
2019-11-13 21:51:20.896+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13965, gcTime=0, gcCount=0}
2019-11-13 21:51:35.158+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=111, gcTime=14156, gcCount=3}
2019-11-13 21:51:35.258+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14050, gcTime=0, gcCount=0}
2019-11-13 21:51:49.580+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14052, gcTime=0, gcCount=0}
2019-11-13 21:52:03.844+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13977, gcTime=0, gcCount=0}
2019-11-13 21:52:17.992+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Scheduled checkpoint for time threshold" @ txId: 656719 checkpoint started...
2019-11-13 21:52:18.154+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14017, gcTime=0, gcCount=0}
2019-11-13 21:52:18.184+0000 INFO [o.n.k.i.s.c.CountsTracker] Rotated counts store at transaction 656719 to [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.a], from [/var/lib/neo4j/data/databases/graph.db/neostore.counts.db.b].
2019-11-13 21:52:33.109+0000 INFO [o.n.k.i.t.l.c.CheckPointerImpl] Checkpoint triggered by "Scheduled checkpoint for time threshold" @ txId: 656719 checkpoint completed in 15s 117ms
2019-11-13 21:52:33.109+0000 INFO [o.n.k.i.t.l.p.LogPruningImpl] No log version pruned, last checkpoint was made in version 2
2019-11-13 21:52:33.172+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=14754, gcTime=0, gcCount=0}
2019-11-13 21:52:47.424+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13978, gcTime=0, gcCount=0}
2019-11-13 21:53:01.615+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=13900, gcTime=0, gcCount=0}

We've isolated a seemingly expensive MERGE statement to this:

MATCH   (f:User)-[:SAID]-(r:Review {public:true})-[:ABOUT]->(t:Thing)<-[:ABOUT]-(c:Card {type:{type}}), 
        (f)-[fo:FRIEND_OF]->(u:User {bjid:{bjid}}) 
    WHERE (not exists(t.closed)) 
        and (size(c.bjid) = $size) 
        and (not (u)-[:FORGOT]->(t)) 
        and (not (u)-[:FORGOT]->(r)) 
        and (not (u)-[:DELETED]->(f)) 
        and (t.lng > $lnga) 
        and (t.lng < $lngb) 
        and (t.lat > $lata) 
        and (t.lat < $latb) 
    OPTIONAL MATCH  (t)-[:IS_TYPE]->(p:Type), 
                    (r)-[:TAGGED_AS]->(g:Tag), 
                    (t)-[:IS_IN]->(l:Location) 
    RETURN t.bjid as tid, t.name as name, t.googleid as gid, t.lng as lng, t.lat as lat, collect(p.name) as typs, collect(g.name) as tags, c.bjid as cardid, id(l) as locid, l.name as locname, fo.cats as ftrust, null as fftrust, 'none' as special, f.bjid as fid, r.text as insight, 2 as rank 

with the parameters:
:params {bjid : "u00000130", lnga : -97.84595489497035,lngb : -97.58846282954046,lata : 30.107117887064263,latb : 30.441570071491142,type : "thing",none : 'none', size:6}

and we have the indexes:

Indexes
   ON :Alias(name) ONLINE  (for uniqueness constraint)
   ON :Card(bjid) ONLINE  (for uniqueness constraint)
   ON :Collection(bjid) ONLINE  (for uniqueness constraint)
   ON :Collection(tag) ONLINE  (for uniqueness constraint)
   ON :GooglePlaceType(name) ONLINE  (for uniqueness constraint)
   ON :Location(googleid) ONLINE  (for uniqueness constraint)
   ON :Tag(name) ONLINE  (for uniqueness constraint)
   ON :Thing(bjid) ONLINE  (for uniqueness constraint)
   ON :Thing(googleid) ONLINE  (for uniqueness constraint)
   ON :Type(name) ONLINE  (for uniqueness constraint)
   ON :User(bjid) ONLINE  (for uniqueness constraint)
   ON :User(mobile) ONLINE  (for uniqueness constraint)
   ON :Username(name) ONLINE  (for uniqueness constraint)

Constraints
   ON ( alias:Alias ) ASSERT alias.name IS UNIQUE
   ON ( card:Card ) ASSERT card.bjid IS UNIQUE
   ON ( collection:Collection ) ASSERT collection.bjid IS UNIQUE
   ON ( collection:Collection ) ASSERT collection.tag IS UNIQUE
   ON ( googleplacetype:GooglePlaceType ) ASSERT googleplacetype.name IS UNIQUE
   ON ( location:Location ) ASSERT location.googleid IS UNIQUE
   ON ( tag:Tag ) ASSERT tag.name IS UNIQUE
   ON ( thing:Thing ) ASSERT thing.bjid IS UNIQUE
   ON ( thing:Thing ) ASSERT thing.googleid IS UNIQUE
   ON ( type:Type ) ASSERT type.name IS UNIQUE
   ON ( user:User ) ASSERT user.bjid IS UNIQUE
   ON ( user:User ) ASSERT user.mobile IS UNIQUE
   ON ( username:Username ) ASSERT username.name IS UNIQUE

and the execution plan below:

Can you increase the Heap size in the neo4j.conf file? That can be a simple fix if you still have space in Memory on your Machine

These don't appear to be indexed. When you filter based on non-indexed properties I am not sure you are searching efficiently. Same goes for WHERE (not exists(t.closed))

Maybe try the following Indexes:

CREATE INDEX ON :Thing(lat)
CREATE INDEX ON :Thing(closed)

Also you may want to use a Label :Closed rather than a flag t.closed if that is only a binary flag as you can get better performance from Labels sometimes for binary properties like that. I.e. - WHERE ( NOT t:Closed)

I could but I’ve put the memory settings as recommended in mem-rec, if I’m going to do something else can you suggest how I work out what to set it to?

Thing.lat and Thing.lng are coordinations, hadn’t considered indexing in those at all but will try that and closed. Thanks.

@benjamin.squire can you recommend what we should set the heap size to or how we should calculate it?

machine has 16GB and memrec recommends:

dbms.memory.heap.initial_size=4900m
dbms.memory.heap.max_size=4900m
dbms.memory.pagecache.size=6100m

which is what we've got.

Regarding the query plan, we can't make use of that with the plan elements collapsed. Please execute the PROFILE again, then use the double-down arrow in the lower-right to expand all plan elements, then add that here or replace the previous plan.

Well you are using 10 GB on a 16 GB machine. You might be able to squeeze another 3 GB in heap initial and max_size that could help.

As Andrew stated without seeing the profile of exactly what is happening its hard to say how to optimize. Alternatively you could try and use a machine with more Ram.