Hi, I'm using neo4j community and have noticed that the performance really degrades when there are multiple requests in parallel using bolt.
Setup:
Python 3.10
Neo4j driver 5.13; using async
Neo4j community 5.13
Imported 20GB worth of data, 50m nodes 70m relationships
Running on 50G server, memory config per the recommendations
500 min - 2000 max bolt threads
I have a usecase where I want to make searches to this graph. In essence I have the same query I want to execute with different parmeters that I want to expose through some api. It uses parmeters, unwind when there are multiple things in to search (limited at 15) and so on, all the best practices I could find in the docs. Disk io is minimal after warmup.
When I do the following from two different python repl instances performance degrades and more than doubles:
await asyncio.gather(*[graph_search(terms) for _ in range(10)]) # Just to simulate some load
When I increase 10 to 20 it gets even worse. What am I doing wrong and how can I improve this? In a production env I would want this to handle ~1M queries/hour.
So I've conducted some experiments and will present them here to illustrate limitations of async Python. It's important to understand that async Python is single threaded. So there is no real parallelism. The only way async can speed up your program is if there are sufficiently large waiting times (network latency or other processes like the DBMS taking time to process a request).
My test setup is as follows:
Neo4j 5.12.0 (enterprise) running in Docker on localhost
toxiproxy 2.7.0 running in Docker on localhost to introduce artificial network latency
python 3.11.0 running 5.0 branch (nightly) of neo4j-python-driver
First let's see a single-threaded single-process example gradually ramping up the number of concurrent tasks with different levels of network latency.
As you can see, with low latency, async doesn't help much. And all experiments show, that too much concurrency actually harms the performance. This is because the tasks will start to fight for resources (network connections, synchronization primitives, etc.). Where exactly that turning point occurs depends on the latency (network latency and query complexity, i.e., time it takes the DBMS to compute the results). Also note that this example is a very simple query with very little data to be moved by the driver. As the query parameters or the result set become bigger, the performance will degrade even more as the tasks will be blocking the event loop longer as they process the data.
Next up, I modified the experiment slightly by testing with multiple processes. Note that my machine has 16 cores. So assuming the driver and the server both use a comparable amount of resources (ignoring toxiproxy and other programs I run like my IDE in this equation), we'd see some plateauing at around 8 parallel processes driving the test (8 driver processes saturating half my CPU and 8 DBMS threads saturating the other half).
This is looking pretty much as expected. The time is almost constant regardless of the number of parallel processes (note that the amount of work done is not constant but linear in the number of processes). This constant behavior stops at 6 processes. There's a jump from 6 to 7 and and another from 8 to 9. From there it's linear-ish as I've fully saturated all of my CPU's cores.
i could imagine that you're experiencing the slow-down because of the kind of query you're sending. If the queries can't be parallelized well, you'd also not see much benefit from running more coroutines or client processes.
But I'm not an expert on query optimization and neither do I know enough about exactly what your query and data looks like.