Apoc.periodic.iterate leading to java.lang.StackOverflowError

Hello!

I'h hoping someone has any experience with this, as there's several similar threads but none of the solutions seem to work for me.

I'm trying to batch-create a series of relationships in a db with pre-existing nodes.
I have a set of csv.gz files that I am processing individually, the largest of which has details for ~60_000_000 relations.

The following query leads to a java.lang.StackOverflowError.
The reason for the relationship MERGE is that Species have separate files for breeding and non-breeding traits

CALL apoc.periodic.iterate(
  "CALL apoc.load.csv('/path/to/file/102998555_breeding.csv.gz') yield map as row return row",
  'MATCH (c:Cell{cellid:toInteger(row.sid)})
   MATCH (s:Species{taxid:toInteger(102998555)})
   MERGE (c)-[r:CONTAINS]->(s)
   SET
    r.breeding_range = toFloat(row.range),
    r.breeding_aoh = toFloat(row.aoh)',
  {batchSize:100000, iterateList:true, parallel:true}
  );

The query fails even if I set batchSize:100 and parallel:false.

System details:

neo4j-desktop-1.4.15-x86_64.AppImage
database version 4.4.7
apoc version 4.4.0.5
dbms.memory.heap.initial_size=4G
dbms.memory.heap.max_size=4G
dbms.memory.pagecache.size=1G
dbms.tx_state.memory_allocation=ON_HEAP

Thank you!

From what I read, the periodic iterate calls the first statement to get the data, then calls the second statement for each result from the first statement. I suspect the first operation is demanding a huge amount of memory to collect the result of the import of your huge files, so it can be batched and processed by the second operation. One way to indirectly confirm the issue is in the first stage is to change the second statement to something like ‘return 1’ and see if you still get a stack overflow error.

If the above is true, the two-phased approach of periodic iterate seems problematic for your use case. Maybe try cypher’s ‘call subquery in transaction’ instead.

I tried editing the apoc.periodic.iterate call with 'RETURN 1' as the second parameter, but get the same results.

I also tried the following query, pulling MATCH (s:Species) out of the CALL section in order to recycle it. But I get another error.

Neo.DatabaseError.Statement.ExecutionFailed null

:auto LOAD CSV WITH HEADERS FROM "file:///path/to/file/102998555_breeding.csv.gz" AS row
MATCH (s:Species{taxid:toInteger(102998555)})
CALL {
  WITH s, row
  MATCH (c:Cell{cellid:toInteger(row.sid)})
  MERGE (c)-[r:CONTAINS]->(s)
  SET
    r.breeding_range = toFloat(row.range),
    r.breeding_aoh   = toFloat(row.aoh)
  } IN TRANSACTIONS OF 1000 ROWS;

Screenshot from 2022-06-07 15-54-40.png

I don't think cypher's LOAD CSV supports compressed files. Your file path ends with '.gz' extension. The apoc.load.csv apparently supports compressed files.

Also, maybe move the the first MATCH to before the LOAD, so it is not repeated for each row.

LOAD CSV works with .gz files, as many smaller files were imported succesfully.
That said, I suspect that transaction processes don't like large csv.gz files, as the file in question was imported very quickly once I un-gzipped it before running the import.

I wouldn't call this strictly solved, as the issue is still there.

Are you stating you were able to get the large file imported as needed once the file was uncompressed first?

Yes, that is correct.

Both apoc and LOAD CSV work with large csv files and small csv.gz files. Both fail with large csv.gz files.