Neo4j 5.18: 100% on all CPUs followed by OOM

Hi !

I'm facing another issue after upgrading from 4.4.32 to 5.18.1.

One of the scripts i'm using to import and MERGE data ends up crashing, with an unresponsive Neo4j database.

Basically I observe that at some point Neo4j takes 100% of the CPU (like almost 100 on every possible core).

At that point I tried to execute some queries in the Neo4j browser, to understand what was going on, but the database wasn't responding to a simple SHOW INDEXES.

Note: the logs didn't indicate any OOM error at this stage.

Furthermore my script is executing everything inside a transaction, and as far as I understand, transactions are still single core, so I don't get how it ends up using every possibles core at 100%.

api-1    | 🚀 Server ready at http://localhost:4000/
neo4j-1  | Exception in thread "neo4j.Scheduler-1" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | 
neo4j-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp168703427-289"
neo4j-1  | 
neo4j-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "neo4j.Scheduler-1"
neo4j-1  | 
neo4j-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "qtp168703427-287"
neo4j-1  | 
neo4j-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "pool-33-thread-1"
neo4j-1  | 
neo4j-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "JNA Cleaner"
neo4j-1  | Exception in thread "pool-35-thread-1" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | Exception in thread "qtp168703427-292" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | Exception in thread "Log4j2-TF-3-Scheduled-1" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | 
neo4j-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "neo4j.IndexSampling-9"
neo4j-1  | 2024-04-14 18:36:01.113+0000 ERROR Client triggered an unexpected error [Neo.DatabaseError.Statement.ExecutionFailed]: Java heap space, reference 6251f318-7da7-43e1-b86a-2c553a445892.

Note: my script worked very well with Neo4j 4.4.32, so I don't understand what's happening here.
Note2: I double checked, and I have my indexes and unique constraints applied this time ;)

  • Neo4j 5.18.1 (Docker)
  • Python stack with neomodel and direct Cypher queries

@Wenzel

Is total RAM from 4.4.32 and 5.18.1 the same?
Is memory assigned to mix/max heap and pagecache the same between 4.4.32 and 5.18.1?
Is your python script creating a single txn which then for example attempts to commit 10 million changes in a single txn?

Hi @dana_canzano !

So i'm executing the same script, in the same conditions on my 16GB laptop
In both cases, Neo4j has the default configuration from the Docker image.

I need to commit around 20k nodes and maybe 10k relationships, in a single transaction to protect the integrity of the database in case of failure.

@Wenzel

Neo4j has the default configuration from the Docker image.

so no explicit configuration of min/max heap and pagecache?

so no explicit configuration of min/max heap and pagecache?

that's correct.
The only env var that i modify is NEO4J_PLUGINS to enable APOC.

So i've done more testing tonight, and i'm pretty sure there is something broken with Neo4j 5.18.1

4.4.32 :white_check_mark:

neo4j_4

5.18.1 :x:

neo4j_5

5.1 :white_check_mark:

Works fine !

So something broke here between 5.1 and 5.18.1..

Poke @andrew_bowman how do we work from here ?
I can do some regression testing to find out which Neo4j 5.x version introduced the bug.

Do you need more log output ?
Should I open an official issue to track this ?

I'm available on zoom call if that helps.

@wenzel

Admittedly its apples to apples relative to lack of memory configurations and per my update on Neo4j 5.18: 100% on all CPUs followed by OOM - #4 by dana_canzano but this is not typical. Defaults are but defaults and might not always be optimized
Also, 5.18.1? Any reason not to use 5.19? and as it includes

Scale and Availability
*  A new and improved eagerness analysis algorithm reduces the number of eager operators and improves explainability and performance, and reduces memory utilization.

First, you'll want some guardrails to prevent the system from going out of heap memory. We have documentation on that here:

Next, you'll want to understand why your query or queries are executing differently. You'll want to get some EXPLAIN plans of the query, both from the old system and the new.

Things you are looking for:

  1. NodeByLabelScan or AllNodeScans, as these are highly expensive when you are ingesting data
  2. Eager operators along the leftmost line of the plan.

There may be deeper tuning to do, but those are the two big things to watch for.

If you find that there are NodeByLabelScan or AllNodeScans in the plan on 5.18 but not your other system, then you are lacking critical indexes that are probably contributing to the issue.

@dana_canzano

Also, 5.18.1? Any reason not to use 5.19? and as it includes

Neo4j 5.19 wasn't available until 2h ago on Dockerhub:

I just pulled and launched my insertion scripts again, but it ends up the same way:

neo4j-1  | Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread "neo4j.Scheduler-1"
neo4j-1  | Exception in thread "neo4j.StorageMaintenance-3" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | Exception in thread "qtp1975880178-60" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | Exception in thread "neo4j.CheckPoint-2" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | Exception in thread "Log4j2-TF-9-Scheduled-2" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | Exception in thread "qtp1975880178-63" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | Exception in thread "neo4j.StorageMaintenance-5" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | Exception in thread "qtp1975880178-67" java.lang.OutOfMemoryError: Java heap space
neo4j-1  | 2024-04-15 23:31:06.097+0000 ERROR [bolt-18] Terminating connection due to unexpected error
neo4j-1  | java.lang.OutOfMemoryError: Java heap space
neo4j-1  | 2024-04-15 23:31:09.568+0000 ERROR Client triggered an unexpected error [Neo.DatabaseError.General.UnknownError]: Could not initialize class org.neo4j.cypher.internal.CypherCurrentCompiler$, reference 15bbbfe0-512c-42aa-9a6a-2290258f0c9a.

Same heap space issue.

Also, so far, I tested until Neo4j 5.13 and this release is working.

The issue must be between 5.14 and 5.18

Defaults are but defaults and might not always be optimized

Given the behavior of Neo4j, I don't think this is related to a simple memory threshold being crossed.
Because the other versions are not consuming so much memory, and also the transaction is commited in a matter of seconds.

With Neo4j 5.18.1, it hangs for minutes while spinning up my CPU for no reason, definitely something weird here.

So, through my testing of the Neo4j versions, i found the following data:

  • 5.1 :white_check_mark:
  • 5.7 :white_check_mark:
  • 5.13 :white_check_mark:
  • 5.15 :white_check_mark:
  • 5.16 :x:
  • 5.18.1 :x:
  • 5.19 :x:

I believe the issue i'm facing has been introduced by Neo4j 5.16 release.

From now on i will simply downgrade to 5.15.

I will have a look at the EXPLAIN for the query on both 5.15 and 5.16, and see if there is a significant difference.