Neo4j Performance on an ARM64v8 System

Hello, I’ve been doing some hands-on research into databases to be used for embedded development at our company. Target system has an ARM64v8 architecture.

Neo4j was the first database I tested. With ~2600 entries/calls to store a “configuration file”, I’m observing a storage/execution time of around 30s. This is ridiculous. Something must be going wrong, right?

I compared other graph databases.

ArangoDB, written in C++, takes around 1.5s to store everything. So it’s not a case of number of entries or using a graph database. Maybe language or VM?

OrientDB, written in Java, takes around 2s. Reasonable, so it’s probably nothing with the JavaVM on ARM.

I tried running Neo4j both from installing+running the binaries in /opt and running from the Docker image. I’m getting the same performance either way.

I wondered if it was something with the driver or API. I’ve been testing a lot of these in Python, so first I tried using the REST API instead of the Python driver. Same result. And it’s equally slow in other languages (Java, C++).

I also decided to try some other databases in the meantime: Sqlite and PostGRES, while both sub-second, the complexity lies in creating the relational tables; Mongo, which failed because the ARM CPU was missing a feature it needed; DGraph and JanusGraph both just didn’t want to work.

So I circled back to Neo4j a few times, puzzled at what could cause this sort of slow down. I’m observing that it has plenty of memory, disk space, and CPU resources.

From what I’ve read about optimizing queries, I’m doing what’s expected in node and edge creation, same approach with other graph databases. And if I run things locally on my PC, it takes 1.6s.

I’ve searched around on the forums here and on Github to see if there was any known issues with ARM, but I’m not finding anything recent (I see that support on ARM64 was experimental some years ago, but it looks like it’s out of that phase right now).

Any other ideas I can test and prod with? Or am I missing some notice somewhere saying that ARM really is expected to be this slow?

Hmm great question, we’ve been supporting ARM for a few years now, and it’s the first time that someone reported such a problem.

Our sandbox environments have been running on graviton for 2+ years now and most of our devs have Macs with ARM as well so it would have shown up earlier.

Wonder if it’s IOPS related? Do you have more detailed information about the spec of the system you’re using?

Is there any chance you could collect some stats from the system using perf or async-profiler while doing the inserts?

Do you run neo4j with default config settings? Which JDK are you using? 25 ?

2600 calls (which you can probably batch into one with a list of dict input) it should take sub-second on most systems to write.

I’m testing this on a TI Jacinto J721e A72 processor running one of their latest recommended Arago Linux images (Scarthgap TI-SDK 11.01.05).

I’ve been trying different things, trying to characterize this issue better. The time it takes scales by what’s in the database already (baseline zero configs stored: ~30s, one config stored: ~60s, two ~90s), but not in a very nice scaling. And by batching the commands, I am getting it to be faster (~10s) when done through the REST API, but that still seems too slow.

Yes, I am doing all of this with the default configuration settings. Out of container, I’m using Java 21, and inside the container it seems like it’s also Java 21. Maybe that’s one thing I can try–getting Java 25 working in the embedded environment.

I will work on getting some more of this profiled and update here as I’ve gathered new information.

Edit: Update with Java 25–it is slightly faster (9.2s with batch instead of 10s). There’s still something other slowing it down.

Okay, this is going to get weird. I have async-profiler now installed, and now with that running, I’m getting sub-second results (between 0.5s to 0.9s for each run) for batch processing. Unbatched, it’s 4.8s. I’m a little bit confused on what async-profiler did to change this. :sweat_smile: Fair to say, it didn’t help identify what was taking that extra 25 seconds.

It could be something with memory allocation, as I’ve seen this sort of stuff happen with homemade allocation and garbage collectors in C++, but I’m a bit oblivious of the Java landscape.

But up until that point where I ran async-profiler, I could predictably get about 30s runtime with the unbatched dataset. And after, dramatically reduced.

I also just restarted the system, and it was an improvement from before. 12 seconds for unbatched data instead of the 30s from before. I run the async-profiler, and the same thing runs in 4 seconds.

Edit: So looking at the profile graphs, I’m wondering if the CompileBroker/C2C part of the code is what was originally slowing things down. JIT does improve over time, as observed by later executions, as it optimizes more and more. I just don’t get what the async-profiler did to make it optimize better.

So update: by using APOC over the REST API to import JSON data, that only takes 1.2 seconds: Load JSON - APOC Core Documentation
The JSON data has the same data that I’ve been trying to commit with Cypher, so that’s another datapoint in all that is happening. It’s a workaround for the moment, but not a solution.

Edit: Scratch that. The call was 1.2s, and it was returning an error. After actually installing the APOC plug in, I’m getting a JSON import of…13s. Still not very helpful. (And I’m still not sure if the import is successful…)