Hey all,
I was testing the speed of neo4j-driver (1.6.2 and 1.7.1) and py2neo (4.1.3) and I found that the simple HTTP requests that I was doing are 2-5 times faster for medium sized queries and up. I'll take you through what I did, so we can hopefully figure out what's going on here and when it makes sense to use the libraries.
Now that I hopefully have your attention, lets take a step back and give you some background info. When I started working with neo4j, I learned about by sending JSON HTTP requests to the API at the hostname:7474/db/data/transaction/commit
endpoint. I like knowing the guts of what I'm dealing with, so processing the raw JSON responses works well for me and over time I've added my own thin wrapper around the python requests library for some quality of life improvements.
I saw that some colleagues are using py2neo, so I wondered if using a driver library would make sense for me. For one they provide interactions through bolt
, which sounds to me like it would be more efficient than sending raw JSON. Also, the results are a bit nicer to work with (I got a bit bored of writing for datum in response['results'][0]['data']
but the downside is that it seems a bit awkward to execute many queries in a single transaction (e.g. one type of query but with varying parameters).
Setup
So I set up a small test bench. In python (3.7) I created a script that iterates through the different drivers and test scenarios. It executes the queries 10 times and averages the result (while also showing the time for each individual execution of a test scenario). The graph for these tests is a copy of our 'production' database with order of magnitude millions of nodes and version 3.0.6 of neo4j (outdated yes, but it's realistic for my usecase). As my driver libraries I tested py2neo 4.1.3, neo4j-driver 1.6.2 (which comes with py2neo) and neo4j-driver 1.7.1.
I created 3 of scenarios:
-
- retrieve the total number of nodes in the graph. This is a simple 'ping' to see how fast a single query with a single (pre-calculated) answer returns result
-
- retrieve nodes with a specific label, run with a limit of either 10, 100, 1k or 10k nodes in the response.
-
- retrieve nodes by an indexed property, the property is passed as a parameter (so the query can be cached), executed as transactions containing either 10, 100 or 1k queries
Note: I figured that the difference between reading and writing would be mostly down to neo4j, and not down to the library that I used to execute the commands, so I didn't go through the trouble of generating data to push.
Note 2: I chose the upper limits for scenarios 2 and 3 based on how long I had to wait for the 10 repetitions. I could've gone orders of magnitude larger on each scenario, but I didn't feel like waiting minutes.
Results
Let me start by saying that I know that the order in which drivers are tested probably matters because of caching. I indeed see the response times dropping after the first query. This only seems to matter a few ms though, which isn't so important when we look at the queries that return more than 10 results.
Scenario 1
Average request time in milliseconds after 10 repetitions.
HTTP requests wrapper: 2.2ms
py2neo (bolt): 0.5ms
neo4j-driver (1.6.2): 1.0ms
neo4j-driver (1.7.1): 0.7ms
As you can see, it's close together, but there's no question that py2neo and neo4j-driver are always faster than going through the HTTP requests wrapper. If we need many different, single queries then it would make a big difference, but the difference isn't human noticeable for a few queries.
Scenario 2
Average request time in milliseconds, after 10 repetitions. Response times given in order for response limits of 10, 100, 1k and 10k nodes.
HTTP requests wrapper: 3.0, 4.8, 14.3, 95.1
py2neo (bolt): 1.6, 5.6, 42.9, 435.5
neo4j-driver (1.6.2): 2.3, 5.2, 29.0, 283.2
neo4j-driver (1.7.1): 2.5, 4.5, 30.8, 296.3
This came as a huge surprise to me, it seems that the simple HTTP requests are a lot faster than both other libraries, but also faster than py2neo running over HTTP (results not shown, as it was a bit slower than py2neo over bolt). The difference between the simple requests and neo4j-driver is a factor of 2.5-3x and py2neo is 4.5x slower. The driver libraries seem to scale linearly above 1k nodes, whereas the requests scale better than linear.
Scenario 3
Average request time in milliseconds, after 10 repetitions. Response times given in order for 10, 100 and 1k queries in 1 transaction.
HTTP requests wrapper: 2.8, 7.5, 44.0
py2neo (bolt): 2.8, 26.6, 243.0
neo4j-driver (1.6.2): 2.9, 14.0, 147.3
neo4j-driver (1.7.1): 3.3, 17.6, 170.8
Again, quite a big difference between the simple HTTP requests and the libraries. The neo4j-drivers are around 3.5x slower than simple requests, and py2neo is a whopping 5.5x slower. Again the driver libraries seem to scale linearly with the size of the test.
Discussion
My first instinct was that the driver libraries present the results in a more usable way (list of dicts) than the raw JSON response through the HTTP API. So I built some additional logic to see how costly this transformation is. The result is that it takes 5-10% longer to return the nice list of dictionaries.
So, what is going on here? I imagine that the version of neo4j might play a role. Also the driver libraries might do some additional fancy processing, which can be nice but also costly if you just want to retrieve a lot of nodes and their properties.