Reproducible Intermittent Crash during BatchImport

carlo2 · April 11, 2023, 1:10pm

Hello, Neo4J team and community.

We can reliably reproduce an intermittent but frequent crash in the batchimport function that is invoked through neo4j-admin database import full, when running in Docker on an AWS EC2 instance. There are no error messages, stacktraces, or output to any of the logs when the process crashes. The crash overwhelmingly occurs during the third stage, relationship linking, and affects us about 90% of the time when we run an import. It will occasionally complete successfully. This crash does not occur when using the same dataset on docker-desktop on a Mac, or when running on natively on a Mac.

Neo4j version: 5.6.0 (Docker sha256-e0d5c90a53158a563c66ecc5cccf708cdfb5f4fb8c3ad26d7d2ceff1aac1f1a7)
Operating system: Fedora 37
API/Driver: Docker

Steps to reproduce

We observed this issue exclusively inside of AWS EC2 instances. Inside of an arm-based EC2 instance (we tested under c6g.2xlarge, c6gd.4xlarge, m6g.xlarge ,m6g.2xlarge, and m7g.4xlarge, always with a GP3 disk, with 6000-7000 IOPS and 625-750MB of throughput) running a clean installation of fedora, with docker.
We download and extract our dataset; Weighing in at 140-180GB, it is a series of 4 CSVs that contain nodes, and 107 CSVs that contain relationships. The dataset varied between 700M-1.1B nodes, and 900M-1.9B relationships, but all variations triggered the bug.
We kick off the import inside of docker. See 'Additional Details' section for "Full import command used".
We typically see the first two stages, node import, and relationship import, finish successfully.

Expected behavior

Stages 3 and 4 finish successfully without error.
The database is started and the data is visible, and we can add indexes and begin querying.

Actual behavior (n = ~50)

Probability	Result
~90% of the time	Stage 3 completes to about 75%. Then, the program exits with exit code `0`. No error messages are printed. No messages are printed to the logs.
~5% of the time	The import completes successfully, and the database is usable.
~4% of the time	The entire import process completes and prints a success message, but on starting the server, the database is corrupt and unusable, none of the data is present.
~1% or less	Step 4 crashes, and exits without displaying an error. The logs do not contain any useful information.

We are not Neo4J experts, but the probabilistic nature of the crashes leads us to guess that the issue comes from a bug in concurrent processing or concurrent disk writes. The fact that we can successfully import it 5% of the time (and 100% of the time on other platforms) leads us to believe that the dataset is not the issue. The fact that we cannot reproduce the issue with the exact same container and dataset (hashes verified equal) on docker desktop on MacOS (also ARM based), leads us to guess that specific performance characteristics of the kernel (the version varies between the Fedora EC2 and the docker-desktop VM) or the underlying hardware are triggering this bug. We have also tested disabling SELinux in the Fedora instance, to no effect.

Additional Details

Full import command usedNodes CSV FormatRelationships CSV Format

We would appreciate any assistance in either fixing or working around this issue. We would be glad to assist in reproducing or debugging in any way we can.

Many Thanks,
Carlo Latasa
Sigma Ratings

Omer · April 14, 2023, 9:04pm

Goodevening,
Never used AWS with ARM or that size of datasets.
But for hardware logging in Linux you could look in the ringbuffer with the command:
$ dmesg
Maybe some memory tunning you could look at:
$ sudo neo4j-admin server memory-recommendation --docker

With x86 I also get the advice:

# It is also recommended turning out-of-memory errors into full crashes,
# instead of allowing a partially crashed database to continue running:
NEO4J_server_jvm_additional='-XX:+ExitOnOutOfMemoryError'

Yours kindly
Omer

Topic		Replies	Views
Inconsistent database after neo4j-admin import in docker Import / Export windows , import , neo4j-import , docker	8	1390	November 30, 2020
Slowdown while merging 200k relationships Import / Export import	5	856	February 17, 2020
Docker disappears when creating random graph Import / Export operations	1	1211	August 27, 2020
Unpredictable connection errors with driver transactions Javascript connection , neo4j-driver	2	386	January 30, 2021
Admin Import failing for neo4j v4.0 Import / Export	2	547	February 27, 2020

Reproducible Intermittent Crash during BatchImport

Expected behavior

Actual behavior (n = ~50)

Additional Details

Related topics