Tuning for larger-than-memory multiple-TB graph node insertion

First project using Neo4j, learned cypher just last month, so if this is not a performance problem apologies for the wrong assumption.

First off, all the code involved is open, and the changes causing these performance issues are specifically from Further performance and RAM usage improvements: batch inserts by robobenklein · Pull Request #9 · utk-se/WorldSyntaxTree · GitHub

I've been trying to get past the ~250GB of data stored mark for awhile, I've tuned the memory as I've described in a post I made on my own site: Neo4j Performance adventures for petabyte-scale datasets – Unhexium

At my current stage I am hitting a lot of transient errors that don't seem to be resolving fast enough, most recently is neo4j.exceptions.TransientError: {code: Neo.TransientError.Transaction.BookmarkTimeout} {message: Database 'top1k' not up to the requested version: 609977. Latest database version is 609948}

When running a single file / worker process inserting nodes runs at a few thousand per second, which is not nearly fast enough to get through all the parsed repos on GitHub, so I have 128 workers all submitting batch queries of ~1k nodes per query.

When using just a few workers (like 8-16 total workers, writing ~10k nodes/s) I seem to experience these problems at a much lower rate, and I verified that the jobs that normally fail will pass just fine if they run with just a few workers.

I would expect that given the hardware I'm running this on, I should be able to run at least 64 workers (the machine has 128 hardware threads, 512G of RAM) but some jobs are failing due to the BookmarkTimeout errors.

I set the python driver to use a managed transaction with a timeout of 30 minutes per batch insertion, since inserting 1k nodes shouldn't really ever take more than that, right?

Earlier when I tried to insert 10k/batch I would quickly run out of memory:

Mar 04 21:07:31 caladan neo4j[2061807]: ERROR StatusLogger An exception occurred processing Appender log
Mar 04 21:07:31 caladan neo4j[2061807]:  org.neo4j.logging.shaded.log4j.core.appender.AppenderLoggingException: java.lang.OutOfMemoryError
Mar 04 21:07:31 caladan neo4j[2061807]: Caused by: java.lang.OutOfMemoryError
Mar 04 21:07:31 caladan neo4j[2061807]:         at java.base/java.lang.AbstractStringBuilder.hugeCapacity(AbstractStringBuilder.java:214)

Now with 1k/batch I get yet a different error:

Mar 04 22:34:59 caladan neo4j[2061807]: ERROR StatusLogger An exception occurred processing Appender log
Mar 04 22:34:59 caladan neo4j[2061807]:  java.lang.NegativeArraySizeException: -1508875789
Mar 04 22:34:59 caladan neo4j[2061807]:         at java.base/java.lang.StringCoding.encodeUTF8_UTF16(StringCoding.java:910)
Mar 04 22:34:59 caladan neo4j[2061807]:         at java.base/java.lang.StringCoding.encodeUTF8(StringCoding.java:885)
Mar 04 22:34:59 caladan neo4j[2061807]:         at java.base/java.lang.StringCoding.encode(StringCoding.java:415)
Mar 04 22:34:59 caladan neo4j[2061807]:         at java.base/java.lang.String.getBytes(String.java:941)
Mar 04 22:34:59 caladan neo4j[2061807]:         at org.neo4j.logging.shaded.log4j.core.layout.AbstractStringLayout.getBytes(AbstractStringLayout.java:218)

Though this doesn't seem like a UnicodeDecodeError, since I catch all those in the python program before they ever make it to a cypher query.

The largest a single batch could realistically get is ~1GB, assuming the worst case where every inserted node has a 1MB text property, which is incredibly rare. Even so, with 256G of RAM dedicated to Neo4j, what should I adjust to avoid these problems without sacrificing insert rate performance?

I've dropped the batch size to 100, decreased the number of processes to just 8, and still am getting these errors, so there seems to be some kind of problem around the amount of data I'm trying to insert, not the rate.