Extremely slow query time with concurrent queries

I have a 18GB database, on 3.5.4 enterprise.
When running simple index queries such as match (n:Product {pk:"id"}) return n, which runs at <5ms with single connections, will slow down to >1000ms when handling >10 simultaneous connections, with most of the queries state as planning.

Testing on a variety of systems, from 8 - 64 avail cores, the CPU has always been low. With no obvious memory issues.
To alleviate the issue, I upped the page and heap memory, as well as increasing thread pools and query cache, to no avail:

dbms.memory.heap.initial_size=25G
dbms.memory.heap.max_size=25G
dbms.memory.pagecache.size=30G
dbms.connector.bolt.thread_pool_min_size=50
dbms.connector.bolt.thread_pool_max_size=1000
dbms.connector.bolt.thread_pool_keep_alive=10s
dbms.query_cache_size=100000

We typically restore the graphs from file backups into docker containers for deployment with:
neo4j-admin restore --from=/data/payload --force=true

I do see in debug.log that there are lots of entries such as:

2019-07-09 17:01:24.434+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2102, gcTime=2149, gcCount=1}
2019-07-09 17:05:57.297+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2160, gcTime=2242, gcCount=1}
2019-07-09 17:10:24.668+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2189, gcTime=2272, gcCount=1}
2019-07-09 17:14:28.038+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2210, gcTime=2263, gcCount=1}
2019-07-09 17:18:20.321+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2254, gcTime=2251, gcCount=1}
2019-07-09 17:22:27.098+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2226, gcTime=2286, gcCount=1}

This can't be normal since the DB isn't that big. Any ideas?

That's strange, can you PROFILE the query and add it here? Also, can you try parameterizing your query instead of using string literals? That may help as no replanning will be needed, all similar queries with different parameters will use the same cached query plan.

Was anything else executing at the time? Any triggers setup with your db?

The profile is exactly just 1 item hit, I've optimised my queries. The actual calls are more complex with a mix of queries, but the primary slow downs comes from concurrent mix calls like this:

with N4J_DRIVERS['product'].session() as sess:
    row = sess.run('match (n:Product {pk:$d}) return n', d=arg)

with N4J_DRIVERS['product'].session() as sess:
    row = sess.run('match (n:Product {pk:$d})-[:HAS_CONTEXT]-(c:CloneID) return c.name', d=arg)

I'll see a fast performance degrade from 200 items/s to 30 it/s on 128 threads. For example.
I've tuned -XX:NewRatio=1. Still same thing. I'm on GCP with 8 vCPUs, 52 GB memory btw.

Do you have query logging enabled? If so, could you provide your query.log, or some excerpts from it? We're looking for timing breakdown info (you may need to enable that in your config)

Is there a way you can test with 3.5.7? We had a fix in 3.5.6 regarding our query collector which runs in the background, as it was hogging memory. It may be related, if it was contributing to heap usage and triggering GCs. At the least we can rule out any bugs fixed since 3.5.4.

I upgraded to 3.5.7, seems to help

@andrew.bowman
The upgrade helped, but I'm still running to delays on long running jobs, querys on 3.5.7:

/35.202.116.199:7687> conversion - match (p:Product)-[:HAS_CONTEXT]-(c:CloneID) where p.pk="xxx" return c.name as clone_id - {} - {}
2019-07-10 14:29:25.074+0000 INFO 129 ms: (planning: 92, waiting: 0) - 7 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.86.97:50436 server
/35.202.116.199:7687> conversion - match (s:Product) where s.pk="xxx" return s - {} - {}
2019-07-10 14:29:25.078+0000 INFO 0 ms: (planning: 0, waiting: 0) - 91 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.86.97:50436 server/35.202.
116.199:7687> conversion - match (s:Product)<-[r:HAS_CONTEXT]->(t) where s.pk=$pk return s,r,t - {pk: 'xxx'} - {}
2019-07-10 14:29:25.313+0000 INFO 147 ms: (planning: 90, waiting: 0) - 13 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server
/35.202.116.199:7687> conversion - match (p:Product)-[:HAS_CONTEXT]-(c:CloneID) where p.pk="xxx" return c.name as clone_id - {} - {}
2019-07-10 14:29:25.464+0000 INFO 129 ms: (planning: 89, waiting: 0) - 7 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server
/35.202.116.199:7687> conversion - match (s:Product) where s.pk="xxx" return s - {} - {}
2019-07-10 14:29:25.467+0000 INFO 0 ms: (planning: 0, waiting: 0) - 49 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server/35.202.

So the majority of the time was spent on planning

You can save on planning time by using parameters instead of literals in your queries, it will be able to use the same plan as long as only the parameters differ between calls.