I want to export/download the whole database from an online neo4j 4.1.3 Enterprise. I would like to stream the data into my client application for processing. Therefore, I tried to use apoc.export.csv.all. I tested it in the browser, streaming only 5 small batches:
CALL apoc.export.csv.all(null, {stream: true, batchSize: 100})
YIELD data
RETURN data LIMIT 5
For small databases, this succeeds, for my large database (~200M nodes, ~250M relationships) the query fails with Failed to invoke procedure 'apoc.export.csv.all': Caused by: java.lang.RuntimeException: Error polling, timeout of 100 seconds reached.
Am I on the right track, here, or should I do it differently?
EDIT:
I started by just streaming query results (e.g. MATCH (n) RETURN n) into my client application. But this seems to consume memory increasingly: After 500k nodes I get:
Neo4j.Driver.ClientException: The allocation of an extra 1.9 MiB would use more than the limit 1.0 GiB. Currently using 1022.9 MiB. dbms.memory.transaction.max_size threshold reached
at Neo4j.Driver.Internal.MessageHandling.ResponsePipelineError.EnsureThrownIf(Func`2 predicate)
at Neo4j.Driver.Internal.MessageHandling.ResponsePipelineError.EnsureThrown()
at Neo4j.Driver.Internal.Result.ResultCursorBuilder.NextRecordAsync()
at Neo4j.Driver.Internal.Result.ResultCursor.FetchAsync()
It seems that each returned node consumes roughly 2kb of data server-side. Is this to be expected?
Neo4j has some limits on the total size of a transaction (which you've already seen) and there are also limits on how long a query can take before it times out. Both of these things are configurable, but if you're trying to export the entire database to CSV in one go, it might not be avoidable that you're doing a single TX that in the end is the size of the database. That's going to be tough unless you jack up your query timeout and also have more memory available than you have data on disk.
In theory, there is no need to keep all the data in memory for the transaction. It should just be an unbuffered stream to my client application. So question is, if technically the data is expected to be held in memory?
Are there other techniques for exporting all data into a standard format (like CSV or JSON)?
My current solution is to split the MATCH (n) RETURN n query into multiple queries, so that transaction memory is released at server-side.
The query looks like this: MATCH (n) WHERE (id(n) % $batchCount) = $batchIndex RETURN n. The number of batches can be determined easily. That seems to work pretty well.