Read / write performance dramatically degrades with concurrent queries

  • Neo4j version: 3.5.13 enterprise
  • Driver: Java Bolt driver 1.7.5
  • Server: Single node - 8vCPUs - 64 GB RAM - Ubuntu 18.04
  • Heap: 24 GB
  • Pagecache: 28 GB

I am noticing that the performance of my reads / writes degrade very quickly when running concurrent queries over an extended period of time. I have a single node Neo4j server and 2 workers that run on different servers. Both the workers run 10 threads each, and process a steady stream of transactions. Ideally I would like to have more workers. For each transaction, a thread runs a few Cypher queries, a mix of read and write queries. The database size is currently ~3.5 GB (4M nodes, 10M relationships).

I have the slow query log enabled (threshold 100ms) and notice some read queries start showing in there pretty quickly. The same queries go from 150-200ms to more than 1s in the slow query log over a short period of time (a few minutes), and then stay in the 1.5s - 2s range. If I increase the number of worker threads this goes up even higher.

One of the example queries is the following:

match (ps:Persona)<-[:BELONGS_TO]-(n:Card {card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=93159 return id(n) as node_id

Here is the above query's profile:

Can using where id(ps)=x be the culprit here, although I am not sure why that should be an issue performance wise? My goal is to have a throughput of around 500-600/s but I am only reaching 8-10/s with the current performance.

There's for sure something wrong here. Getting 8/10 req/s out of this configuration is close to nothing. A target throughput of 500-600 req/s should be easy to reach.

id(x) should not cause any issues. It's a O(1) operation that basically performs a seek operation on your IO subsystem.

Couple of questions to understand the issue better:

  • What's the cardinality? How many Cards does a Persona have? Are there "superpresonas" with a huge number of cards?
  • How many properties to you store on your Card nodes?
  • Can you check using htop (or similar) how busy your CPUs are going while you increase the load?
  • Did you do any config changes with respect to bolt serverside pool sizes?
  • Can you rule out network latency of being an issue? Are your clients based in the same physical location than your server?
  • Do you have any heap extensive operations (large writes, heavy read operations like ORDER BY, DISTINCT, ...) ? If not I think heap of 24G might be too large.

On a different notice: if you're on AWS and use Ubuntu, consider enabling ENA, see Enable enhanced networking with the Elastic Network Adapter (ENA) on Linux instances - Amazon Elastic Compute Cloud. For the given issue I suspect this won't make a big difference, but worth a try.

  • On an average a Persona has 1-2 cards. There are some anomalies where a Persona would have >50 cards but that is highly uncommon. I get similar slowness on other similar queries too with other Labels in place of Card. The cardinality on the other ones is even lower than the Cards
  • A Card only has 2 properties
  • The CPU usage stays between 25-40% consistently
  • I increased the minimum pool size hoping it would help, but I had problems with the default configuration too. Right now I have it as:
    dbms.connector.bolt.thread_pool_min_size=20, dbms.connector.bolt.thread_pool_max_size=400, dbms.connector.bolt.thread_pool_keep_alive=10m
  • The clients are in the same zone as the server. I am on DigitalOcean and the clients connect to the server using private IP. Also, since I see the queries slowing down in the slow query log (the same queries have their execution times go up), network latency might not be an issue. On that thought, can disk I/O be an issue instead?
  • My reads and writes are quite simple. The most work a read does would be a collect(n) on around 10-15 nodes with not many properties. I can try lowering the heap size and see if that helps

Are there any specific things I should be trying to look for in the logs? I have GC logs enabled too. Another thing I am noticing is that the throughput starts high, keeps going down for a while, and then goes back up for a short period of time, and then starts going down again (this pattern basically repeats a lot, see chart below):

Here is a small sample from my query.log. It looks like the queries are spending a lot of time in the planning phase. But also the time taken apart from planning seems to be quite high. I would expect these queries to take not more than a few ms:

2019-12-04 20:56:17.368+0000 INFO  1556 ms: (planning: 1095, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51654	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343875 return id(n) as node_id - {}
2019-12-04 20:56:17.564+0000 INFO  1667 ms: (planning: 1282, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42346	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343741 return id(n) as node_id - {}
2019-12-04 20:56:17.952+0000 INFO  1552 ms: (planning: 1246, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51648	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343896 return id(n) as node_id - {}
2019-12-04 20:56:17.967+0000 INFO  1686 ms: (planning: 1319, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42326	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343838 return id(n) as node_id - {}
2019-12-04 20:56:17.967+0000 INFO  1777 ms: (planning: 1385, waiting: 0) - 32 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42334	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:IP{ip_address: {ip_address}}) where id(ps)=5343939 return id(n) as node_id - {}
2019-12-04 20:56:18.067+0000 INFO  1857 ms: (planning: 1467, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51650	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343921 return id(n) as node_id - {}
2019-12-04 20:56:18.091+0000 INFO  1393 ms: (planning: 1085, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42348	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5344009 return id(n) as node_id - {}
2019-12-04 20:56:18.094+0000 INFO  1231 ms: (planning: 953, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51662	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5343862 return id(n) as node_id - {}
2019-12-04 20:56:18.286+0000 INFO  1652 ms: (planning: 1018, waiting: 0) - 14 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42344	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5333585 return id(n) as node_id - {}
2019-12-04 20:56:18.288+0000 INFO  1687 ms: (planning: 1245, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51652	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343980 return id(n) as node_id - {}
2019-12-04 20:56:18.353+0000 INFO  1762 ms: (planning: 1376, waiting: 0) - 26 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51664	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:IP{ip_address: {ip_address}}) where id(ps)=5343781 return id(n) as node_id - {}
2019-12-04 20:56:18.421+0000 INFO  1727 ms: (planning: 1418, waiting: 0) - 48 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42340	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5344063 return id(n) as node_id - {}
2019-12-04 20:56:18.432+0000 INFO  1840 ms: (planning: 1211, waiting: 0) - 16 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42342	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:ConsumerInternalId{consumer_internal_id: {consumer_internal_id},client_id: {client_id}}) where id(ps)=5276381 return id(n) as node_id - {}
2019-12-04 20:56:18.456+0000 INFO  1746 ms: (planning: 1293, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42338	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5344098 return id(n) as node_id - {}
2019-12-04 20:56:18.474+0000 INFO  1870 ms: (planning: 1437, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42330	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343702 return id(n) as node_id - {}
2019-12-04 20:56:18.478+0000 INFO  1768 ms: (planning: 1240, waiting: 0) - 8 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42328	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343999 return id(n) as node_id - {}
2019-12-04 20:56:18.770+0000 INFO  1676 ms: (planning: 1293, waiting: 0) - 9 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51656	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343929 return id(n) as node_id - {}
2019-12-04 20:56:18.831+0000 INFO  1577 ms: (planning: 1133, waiting: 0) - 12 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51660	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5276381 return id(n) as node_id - {}
2019-12-04 20:56:19.099+0000 INFO  1795 ms: (planning: 1336, waiting: 0) - 8 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51646	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=2386925 return id(n) as node_id - {}
2019-12-04 20:56:19.144+0000 INFO  1715 ms: (planning: 1286, waiting: 0) - 8 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51658	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5343975 return id(n) as node_id - {}
2019-12-04 20:56:19.227+0000 INFO  1618 ms: (planning: 1287, waiting: 0) - 6 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.85.52:42346	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5344156 return id(n) as node_id - {}
2019-12-04 20:56:19.266+0000 INFO  1849 ms: (planning: 1508, waiting: 0) - 25 page hits, 0 page faults - bolt-session	bolt	neo4j	neo4j-java/1.7.5-d56154a7e96ff71615b894ee9a15d08c370691e6		client/10.132.44.126:51654	server/10.132.139.216:7687>	neo4j - match (ps:Persona)<-[:BELONGS_TO]-(n:Card{card_last_4: {card_last_4},card_bin: {card_bin}}) where id(ps)=5211232 return id(n) as node_id - {}

You really should be using parameters in your queries.

Currently, since the values you're using in the queries are hardcoded, each is treated as distinct query in the query cache, so you aren't able to take advantage of that and the query needs to be replanned each time.

If you parameterize (with something like where id(ps)={psId}) then it should look like the same query no matter how the parameters change, allowing you to skip the planning and reuse the query from the cache.

3 Likes

Yes you are right!! I can't believe I was missing that. Every other aspect of my queries are parameterized, for some reason I totally forgot to parameterize the id part. My bad. Thanks for pointing that out.

Also, removing the query planning time, it looks like the queries spend ~300-400ms after the planning phase (not sure if there are other things to consider there too). Is that normal? I would expect it to run a lot faster since the profile shows that it's an index lookup and it's almost always a pagecache hit.

Can you show an expanded query plan of a query for us to see?

Here is the query plan for one of the queries that shows up the most in the slow query log. The other queries are very similar to this one, just use different labels and properties (indexed).

Can you please switch off detailed query logging? Maybe put a treshold of 1sec and make your you've switched off all the detailed options from https://neo4j.com/docs/operations-manual/current/monitoring/logging/query-logging/ - those do impact performance.
Are your CPUs saturated once you run the test with these changes? Would be also interesting how much you spend in "IO wait".

Another thing you can try as a experiment:

  • limit the bolt thread pool to ~= number of cpu cores: dbms.connector.bolt.thread_pool_max_size=10
  • and establish a queue for buffering, note this is a experimental feature:
    dbms.connector.bolt.unsupported_thread_pool_queue_size=10000

That should remove frequest CPU context switches (which are overhead) and increase throughput. If you see client side bolt errors, consider to increase the queue size parameter.

Have you tried another cloud provider for comparison?

I had terrible performance on Digital Ocean, evidently they over deployed on certain instances, and after the latest outage, I moved to GKE ON Google cloud that solved performance and reliability altogether.

Your setup looks pretty normal to me and should run way faster than that so I would double check if a quick test on gcp and Aws runs better or worse.

I had very recently a very similar encounter with extremely slow concurrent write operations.
Apparently, neo4j has a hard time with mixed read/write concurrent operations. In my case, 7 concurrent clients degraded performance by over a 100%. Beyond that, neo was effectively dead past 10 concurrently writing clients with latency & throughput reaching outrageous levels. However, a few things helped me to make things better:

  1. Centralize write operations through an API Gateway with one DB client handling all DB operations.
  2. Batch write operations(!)
  3. Sequential everything possible that does not occur in parallel.

Fixing this issue was a very sobering exercise, but it occurs to me that Java's concurrency underlying neo4j seems to work way better with batches & queues. There is an article discussing the details of tuning neo for concurrent write operations but the very existence of that article left me wondering why I actually need to batch write operations on the application side?

It is particularly hard to swallow when recognizing that DGraph easily handles 500 concurrent connections with each pushing thousands of write operations while neo essentially dies way earlier.

Is there anyone from the Neo4J engineering team around who is willing to share some insights of how to handle hundreds of concurrent client connections with each having thousands of write operations?

Also, why do we we need to batch write operations in the first place to get reasonable performance?

Thanks so much for the all the ideas guys. I think the main bottleneck here was with the query cache. I parameterized all of my queries, disabled detailed query logging and put a higher threshold on the slow query log (500 ms). With that I am able to reach a throughput of well above 1000 req/s. I realized how important query cache is and how important it is to parameterize queries.

2 Likes

Marvin,

it would be great to understand more details on your issue. Would you mind posting it as a new topic with more details neo4j version, setup & configuration, (the type of your read/write queries) PROFILE output for either, query log, and esp. also outputs of call dbms.listQueries() which also show potential contention.

Thank you!

@michael.hunger sorry, no way.

The problems became so bad that we already migrate everything away from neo. For us, neo was a complete waste of time and not worth fixing.

Hello,

I have the same problem with Neo4j: many errors, many side effects and low performance...
May I ask you about alternative of Neo4j? I try to find out the same graph database, but I have not a good alternative...

Best regards

@marvin-hansen Migrated to? what specific issue did you experiment with Neo4J?