Neo.TransientError.General.MemoryPoolOutOfMemoryError with LOAD CSV

I have 2 CSV files with some 60k nodes and 30k edges, resp. I load the nodes with

LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/Parents.csv' AS line
CREATE (:Person {GID: line.GID, Name: line.Fullname, Birth: line.Birth, Death: line.Death});

I create an index and count the imported nodes

CREATE INDEX idxGID FOR (n:Person) ON (n.GID);
match(n:Person)
return count(*)

Everything is fine and I get the correct count. Then I try to import the edges with

LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/Families.csv' AS line
MATCH (p1:Person {GID:line.Father}), (p2:Person {GID:line.Mother})
CREATE (p1)-[:Marriage]->(p2);

This needs some 60 seconds and then an error message is displayed:

Neo.TransientError.General.MemoryPoolOutOfMemoryError
The allocation of an extra 384,0 MiB would use more than the limit 2,8 GiB. Currently using 2,6 GiB. dbms.memory.transaction.total.max threshold reached

My memory settings are:
server.memory.heap.initial_size=4g
server.memory.heap.max_size=4g

To be honest, I'm a bit lost what's going wrong here. Can somebody please help me?

Thanks -- Ulrich

Hi Ulrich,

The problem seems to be that there is too much data in the CSV for you to save this in a single commit, as the transactional data for the commit would exceed the 2.8gb limit.

You will need to batch the changes across transactions. We have documentation on how to do this using CALL {} IN TRANSACTIONS in concert with LOAD CSV here:

If I understand the documentation correctly, my second import statement would be:

LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/Families.csv' AS line
CALL {
  WITH line
  MATCH (p1:Person {GID:line.Father}), (p2:Person {GID:line.Mother})
  CREATE (p1)-[:Marriage]->(p2)
} IN TRANSACTIONS OF 1000 ROWS

This results in the error message:

Neo.DatabaseError.Statement.ExecutionFailed

A query with 'CALL { ... } IN TRANSACTIONS' can only be executed in an implicit transaction, but tried to execute in an explicit transaction.

I guess the problem is that I cannot create the edges between p1 and p2 with this approach. Is there any other way to create the edges from the CSV file?

And BTW the documentation says

LOAD CSV may run into memory issues with files containing a significant number of rows (approaching hundreds of thousands or millions).

I'm talking about 30k edges between 60k nodes so I'm again a bit lost.

Ulrich

Load CSV on its own may run into those memory issues. That's why we're using CALL {} IN TRANSACTIONS to batch the changes.

I think we're almost there, we just need to make sure this is executed in an implicit transaction.

If you're using the Neo4j Browser, then you can prefix the query with :auto (this is interpreted as a command to the browser, not anything to do with the Cypher query), which will tell Neo4j Browser to execute this as an implicit transaction.

If you're executing this via some other approach, we would need more details to point you to the relevant documentation to execute the query in the way that is needed for this.

Ok, let's see. I'm working with the Neo4j browser and my statement is now:

:auto
LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/Families.csv' AS line
CALL {
  WITH line
  MATCH (p1:Person {GID:line.Father}), (p2:Person {GID:line.Mother})
  CREATE (p1)-[:Marriage]->(p2)
} IN TRANSACTIONS OF 100 ROWS

Now the last error message with the implicit transaction is gone but I run into my old memory issue again:

Neo.TransientError.General.MemoryPoolOutOfMemoryError

The allocation of an extra 384,0 MiB would use more than the limit 2,8 GiB. Currently using 2,6 GiB. dbms.memory.transaction.total.max threshold reached

But again Andrew, I'm talking of a CSV file with 30k lines now in 100 rows batches. This is peanuts (as we say in our German/English mixed up language :wink:) and nothing Neo4J should be concerned about. And please remember: the CSV with the 60k nodes was imported without any problems.

So could it be that the real source of the problem is not the LOAD CSV but something else I'm trying to do to create the edges?

Yes, I think you're right about that.

I missed that the error message is about the total transactional memory, and not the transactional memory for a single transaction. Others may be eating up the heap you need.

SHOW TRANSACTIONS

may reveal long-running transactions that may be using up most of your memory. You may need to kill transactions that are outstaying their welcome, or may be collectively eating up the majority of your heap.

If transactions aren't able to be killed, then there may be some other issue in play (and opportunities to add things like transaction timeouts, lock acquisition timeouts, and transaction memory limits for single transactions), but a restart of the leader node would be a fallback to clear out any transactions outstaying their welcome.

I didn't see anything besides the Neo4j browser transaction I was using. But I rewrote my code a bit and now the memory error is gone. Here is my current code:

LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/Parents.csv' AS line
CREATE (:Person {GID: line.GID, Name: line.Fullname, Birth: line.Birth, Death: line.Death});

CREATE INDEX idxGID FOR (n:Person) ON (n.GID);

LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/Families.csv' AS line
MATCH (p1), (p2)
WHERE p1.GID = line.Father AND p2.GID = line.Mother
MERGE (p1)-[r:Marriage]->(p2);

The first LOAD CSV for the 60k nodes is completed after 0.3 s and says that all the nodes have been created and their properties have been assigned. The index creation is completed after 2 ms and SHOW INDEXES lists the index:

So everything appears to be ok for me. Now I run the second LOAD CSV and it runs and runs and runs and ... Finally it says:

Created 30624 relationships, completed after 1241478 ms

That is more than 20 minutes. Obviously I have a major problem with the matching of nodes p1 and p2 and the index appears to be ignored. But what is it?

Ulrich

It’s not going to use the index when there is no label, as the index is specific to a label.

Try this:

LOAD CSV WITH HEADERS FROM 'file:///C:/Temp/Families.csv' AS line
MATCH (p1:Person), (p2:Person)
WHERE p1.GID = line.Father AND p2.GID = line.Mother
MERGE (p1)-[r:Marriage]->(p2);

Looks much better now:

Created 30624 relationships, completed after 765 ms.

How can I find the longest node chain, i e the longest sequence of interconnected nodes? And as a modification of this question: since a node can have more then one edge connected to it, there will be "2-dimensional chains" or "spider webs" in my data. How can I identify the "largest spider web" containing the most nodes?

Can you create a new thread for you last question if you still want some help?