Hello, Neo4J team and community.
We can reliably reproduce an intermittent but frequent crash in the batchimport
function that is invoked through neo4j-admin database import full
, when running in Docker on an AWS EC2 instance. There are no error messages, stacktraces, or output to any of the logs when the process crashes. The crash overwhelmingly occurs during the third stage, relationship linking
, and affects us about 90% of the time when we run an import. It will occasionally complete successfully. This crash does not occur when using the same dataset on docker-desktop on a Mac, or when running on natively on a Mac.
- Neo4j version: 5.6.0 (Docker sha256-e0d5c90a53158a563c66ecc5cccf708cdfb5f4fb8c3ad26d7d2ceff1aac1f1a7)
- Operating system: Fedora 37
- API/Driver: Docker
Steps to reproduce
- We observed this issue exclusively inside of AWS EC2 instances. Inside of an arm-based EC2 instance (we tested under c6g.2xlarge, c6gd.4xlarge, m6g.xlarge ,m6g.2xlarge, and m7g.4xlarge, always with a GP3 disk, with 6000-7000 IOPS and 625-750MB of throughput) running a clean installation of fedora, with docker.
- We download and extract our dataset; Weighing in at 140-180GB, it is a series of 4 CSVs that contain nodes, and 107 CSVs that contain relationships. The dataset varied between 700M-1.1B nodes, and 900M-1.9B relationships, but all variations triggered the bug.
- We kick off the import inside of docker. See 'Additional Details' section for "Full import command used".
- We typically see the first two stages,
node import
, andrelationship import
, finish successfully.
Expected behavior
- Stages 3 and 4 finish successfully without error.
- The database is started and the data is visible, and we can add indexes and begin querying.
Actual behavior (n = ~50)
Probability | Result |
---|---|
~90% of the time |
![]() 0 . No error messages are printed. No messages are printed to the logs. |
~5% of the time |
![]() |
~4% of the time |
![]() |
~1% or less |
![]() |
We are not Neo4J experts, but the probabilistic nature of the crashes leads us to guess that the issue comes from a bug in concurrent processing or concurrent disk writes. The fact that we can successfully import it 5% of the time (and 100% of the time on other platforms) leads us to believe that the dataset is not the issue. The fact that we cannot reproduce the issue with the exact same container and dataset (hashes verified equal) on docker desktop on MacOS (also ARM based), leads us to guess that specific performance characteristics of the kernel (the version varies between the Fedora EC2 and the docker-desktop VM) or the underlying hardware are triggering this bug. We have also tested disabling SELinux in the Fedora instance, to no effect.
Additional Details
Full import command usedNodes CSV FormatRelationships CSV Format
We would appreciate any assistance in either fixing or working around this issue. We would be glad to assist in reproducing or debugging in any way we can.
Many Thanks,
Carlo Latasa
Sigma Ratings