Java Memory allocation issue with Bulk upload in Docker Neo4j:4.1.0

I am running the Neo4j:4.1.0 docker image on Ubuntu. When doing a series of bulk CSV uploads, I am repeatedly getting an issue where the Docker container exits. Inspecting the container logs, Neo4j is running into a memory allocation issue:

Starting Neo4j.
2020-07-02 21:55:13.830+0000 WARN  Use of deprecated setting dbms.connectors.default_listen_address. It is replaced by dbms.default_listen_address
2020-07-02 21:55:13.836+0000 INFO  Starting...
2020-07-02 21:55:17.068+0000 INFO  ======== Neo4j 4.1.0 ========
2020-07-02 21:55:19.073+0000 INFO  Performing postInitialization step for component 'security-users' with version 2 and status CURRENT
2020-07-02 21:55:19.074+0000 INFO  Updating the initial password in component 'security-users'  
2020-07-02 21:55:19.079+0000 INFO  Updating initial user password from `auth.ini` file: neo4j
malloc(): invalid size (unsorted)

The machine does not have a ton of Ram (4 GB), but still more than the minimum 2GB. I destroyed the container and tried again with the Neo4j:4.0 image and things are running fine. I am running Ubuntu 20.04 on Intel x86 with Docker 19.03.8 if that is helpful.

At the moment I am content using the 4.0 docker image- but am curious to see if others have run into a similar issue.

What is the data size you are trying to load?

And how are you loading the data?

Can you please provide more details regarding this issue?

Hello,

I'm having the same issue. I've been trying for a fews days now to upload a data set of 1.5 gb csv files into neo4j and have been un successfull :( I ran into various Out of memory issues... I finally gave up and put it on a 64 GB machine on google compute engine... still no luck.

I load a dataset cosisting of: enterprise.csv, establishment.csv ... establishment is about 17 mil rows.
I have split the csv files in chucks of 1000 and start and end a transaction after merging each chunk.

I first insert all enterprices, then all establishments... this part goes fine. But the creating a relation between them always fails after inserting a few chunks .... don't know how many, let say 200 time 1000 lines.

These are my memory settings:

dbms.memory.pagecache.size=10g
dbms.memory.transaction.global_max_size=10g
dbms.memory.transaction.max_size=10g
dbms.tx_state.memory_allocation=ON_HEAP
dbms.jvm.additional=-XX:MaxDirectMemorySize=20g
dbms.jvm.additional=-Dio.netty.maxDirectMemory=20g

During the upload of the relationships, my docker instance keeps quitting with these logs:
...
2020-08-01 16:55:39.415+0000 ERROR Client triggered an unexpected error [Neo.DatabaseError.Schema.IndexDropFailed]: Unable to drop index: Index belongs to constraint: (:Code {code}), reference 370065a0-bbe7-41f6-9482-d9ed5cb5bea0.
malloc(): invalid next size (unsorted)

this is the log of neo4j itself:
2020-08-01 16:55:41.095+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=269, gcTime=295, gcCount=1}
2020-08-01 16:56:50.787+0000 INFO [o.n.k.d.Database] [neo4j] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.1] version=0, last transaction in previous log=826, rotation took 25 millis.
2020-08-01 16:58:07.057+0000 INFO [o.n.k.d.Database] [neo4j] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.2] version=1, last transaction in previous log=1756, rotation took 22 millis, started after 76248 millis.
2020-08-01 16:59:20.352+0000 INFO [o.n.k.d.Database] [neo4j] Rotated to transaction log [/data/transactions/neo4j/neostore.transaction.db.3] version=2, last transaction in previous log=2779, rotation took 23 millis, started after 73272 millis.

I just saw this: 2020-08-01 16:55:41.095+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=269, gcTime=295, gcCount=1}

I have seen these before, as I said, I am on a 64 gb machine (12 cpu's I think).

Any help would be appreciated! This is driving me nuts. I would have no problem with someone loggin in on the machine and having a look. Nothing secret on there.

Kr,
Nick

// import enterprise files
for (let fileName of fs.readdirSync("/data/containers/neo4j/import")
.map(p => path.basename(p))
.filter(p => p.startsWith("enterprise"))) {
await run( LOAD CSV WITH HEADERS FROM "file:///${fileName}" AS row MERGE (n:Enterprise { id: coalesce(row.EnterpriseNumber, ""), status: coalesce(row.Status, ""), juridicalSituation: coalesce(row.JuridicalSituation, ""), typeOfEnterprise: coalesce(row.TypeOfEnterprise, ""), juridicalForm: coalesce(row.JuridicalForm, ""), startDate: coalesce(row.StartDate, "") }) );
}

// import establishment files
for (let fileName of fs.readdirSync("/data/containers/neo4j/import")
	.map(p => path.basename(p))
	.filter(p => p.startsWith("establishment"))) {
	await run(`
		LOAD CSV WITH HEADERS FROM 'file:///${fileName}' AS row
		MERGE (n1:Establishment {
		 id: 			coalesce(row.EstablishmentNumber, ""),
		 startDate: 	coalesce(row.StartDate, ""),
		 enterprise: 	coalesce(row.EnterpriseNumber, "")
		})
	`);
}

// import establishment files
for (let fileName of fs.readdirSync("/data/containers/neo4j/import")
    .map(p => path.basename(p))
    .filter(p => p.startsWith("establishment"))) {
    await run(`
		LOAD CSV WITH HEADERS FROM 'file:///${fileName}' AS row
		MATCH (n1:Establishment {
		 id: 			coalesce(row.EstablishmentNumber, ""),
		 startDate: 	coalesce(row.StartDate, ""),
		 enterprise: 	coalesce(row.EnterpriseNumber, "")
		})             with n1, row
		MATCH (n2:Enterprise {
		  id:			coalesce(row.EnterpriseNumber, "")
		}) with n1, n2
		MERGE (n2)-[:HAS_ESTABLISHMENT]->(n1);
	`);
}

You are trying to commit whole data in a single transaction. That's the big issue here.

LOAD CSV WITH HEADERS FROM 'file:///${fileName}' AS row
		MERGE (n1:Establishment {
		 id: 			coalesce(row.EstablishmentNumber, ""),
		 startDate: 	coalesce(row.StartDate, ""),
		 enterprise: 	coalesce(row.EnterpriseNumber, "")
		}

This is a single transaction. Depending on how much data you are adding this can require huge amounts of heap.

I would suggest using periodic commit. You can change your query to this

USING PERIODIC COMMIT 10000
LOAD CSV WITH HEADERS FROM 'file:///${fileName}' AS row
		MERGE (n1:Establishment {
		 id: 			coalesce(row.EstablishmentNumber, ""),
		 startDate: 	coalesce(row.StartDate, ""),
		 enterprise: 	coalesce(row.EnterpriseNumber, "")
		}

This will commit the data for every 10,000 records read.

Do the same for the second query also.

Also, make sure you have created indexes. Otherwise the query will run for a longer time and you will see more garbage collection world pause statements.

I can confirm what anthapu said, it can make a big difference if you create indexes before the data load, instead of after.

Note: As a side note just in case you are attempting to create a new neo4j database. When doing bulk loads with neo4j-admin to create a new db from nothing, the indexes (by definition) can't exist in advance, but because transaction tracking is off in that case, the data load is lightning fast. If this is a new database from scratch you will want to look into bulk loads.