Extremely slow query time with concurrent queries

david2 · July 9, 2019, 5:19pm

I have a 18GB database, on 3.5.4 enterprise.
When running simple index queries such as match (n:Product {pk:"id"}) return n, which runs at <5ms with single connections, will slow down to >1000ms when handling >10 simultaneous connections, with most of the queries state as planning.

Testing on a variety of systems, from 8 - 64 avail cores, the CPU has always been low. With no obvious memory issues.
To alleviate the issue, I upped the page and heap memory, as well as increasing thread pools and query cache, to no avail:

dbms.memory.heap.initial_size=25G
dbms.memory.heap.max_size=25G
dbms.memory.pagecache.size=30G
dbms.connector.bolt.thread_pool_min_size=50
dbms.connector.bolt.thread_pool_max_size=1000
dbms.connector.bolt.thread_pool_keep_alive=10s
dbms.query_cache_size=100000

We typically restore the graphs from file backups into docker containers for deployment with:
neo4j-admin restore --from=/data/payload --force=true

I do see in debug.log that there are lots of entries such as:

2019-07-09 17:01:24.434+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2102, gcTime=2149, gcCount=1}
2019-07-09 17:05:57.297+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2160, gcTime=2242, gcCount=1}
2019-07-09 17:10:24.668+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2189, gcTime=2272, gcCount=1}
2019-07-09 17:14:28.038+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2210, gcTime=2263, gcCount=1}
2019-07-09 17:18:20.321+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2254, gcTime=2251, gcCount=1}
2019-07-09 17:22:27.098+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=2226, gcTime=2286, gcCount=1}

This can't be normal since the DB isn't that big. Any ideas?

andrew_bowman · July 9, 2019, 6:08pm

That's strange, can you PROFILE the query and add it here? Also, can you try parameterizing your query instead of using string literals? That may help as no replanning will be needed, all similar queries with different parameters will use the same cached query plan.

Was anything else executing at the time? Any triggers setup with your db?

david2 · July 9, 2019, 6:32pm

The profile is exactly just 1 item hit, I've optimised my queries. The actual calls are more complex with a mix of queries, but the primary slow downs comes from concurrent mix calls like this:

with N4J_DRIVERS['product'].session() as sess:
    row = sess.run('match (n:Product {pk:$d}) return n', d=arg)

with N4J_DRIVERS['product'].session() as sess:
    row = sess.run('match (n:Product {pk:$d})-[:HAS_CONTEXT]-(c:CloneID) return c.name', d=arg)

I'll see a fast performance degrade from 200 items/s to 30 it/s on 128 threads. For example.
I've tuned -XX:NewRatio=1. Still same thing. I'm on GCP with 8 vCPUs, 52 GB memory btw.

andrew_bowman · July 9, 2019, 7:00pm

Do you have query logging enabled? If so, could you provide your query.log, or some excerpts from it? We're looking for timing breakdown info (you may need to enable that in your config)

andrew_bowman · July 9, 2019, 7:26pm

Is there a way you can test with 3.5.7? We had a fix in 3.5.6 regarding our query collector which runs in the background, as it was hogging memory. It may be related, if it was contributing to heap usage and triggering GCs. At the least we can rule out any bugs fixed since 3.5.4.

david2 · July 9, 2019, 9:49pm

I upgraded to 3.5.7, seems to help

david2 · July 10, 2019, 2:33pm

@andrew_bowman
The upgrade helped, but I'm still running to delays on long running jobs, querys on 3.5.7:

/35.202.116.199:7687> conversion - match (p:Product)-[:HAS_CONTEXT]-(c:CloneID) where p.pk="xxx" return c.name as clone_id - {} - {}
2019-07-10 14:29:25.074+0000 INFO 129 ms: (planning: 92, waiting: 0) - 7 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.86.97:50436 server
/35.202.116.199:7687> conversion - match (s:Product) where s.pk="xxx" return s - {} - {}
2019-07-10 14:29:25.078+0000 INFO 0 ms: (planning: 0, waiting: 0) - 91 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.86.97:50436 server/35.202.
116.199:7687> conversion - match (s:Product)<-[r:HAS_CONTEXT]->(t) where s.pk=$pk return s,r,t - {pk: 'xxx'} - {}
2019-07-10 14:29:25.313+0000 INFO 147 ms: (planning: 90, waiting: 0) - 13 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server
/35.202.116.199:7687> conversion - match (p:Product)-[:HAS_CONTEXT]-(c:CloneID) where p.pk="xxx" return c.name as clone_id - {} - {}
2019-07-10 14:29:25.464+0000 INFO 129 ms: (planning: 89, waiting: 0) - 7 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server
/35.202.116.199:7687> conversion - match (s:Product) where s.pk="xxx" return s - {} - {}
2019-07-10 14:29:25.467+0000 INFO 0 ms: (planning: 0, waiting: 0) - 49 page hits, 0 page faults - bolt-session bolt conversion neobolt/1.7.13 Python/2.7.15-final-0 (linux2) client/35.209.150.197:57560 server/35.202.

So the majority of the time was spent on planning

andrew_bowman · July 17, 2019, 4:23pm

You can save on planning time by using parameters instead of literals in your queries, it will be able to use the same plan as long as only the parameters differ between calls.

trinhl · October 22, 2020, 1:42pm

Hi guys, I sill got the same problem although I upgraded to 4.1.0
Many concurrent connections (i.e 500 concurrent connections) requested to Neo4j, the slower performance I got.
I had tried some solutions. Increase heap size, using parameter with query.
And my queries are cypher projection query in order to create subgraph catalog.
So what should I try?

Topic		Replies	Views
Read / write performance dramatically degrades with concurrent queries Neo4j Graph Platform performance , cypher	15	9633	April 12, 2023
Read / write performance dramatically degrades with concurrent queries Neo4j Graph Platform migrated	1	176	July 17, 2022
Diagnosing long running queries Cypher performance , cypher	7	2847	January 8, 2020
Database becomes non-responsive in production Neo4j Graph Platform performance	12	7657	January 6, 2020
Memory settings for big databases Server performance	40	3235	August 1, 2019

Extremely slow query time with concurrent queries

Related topics