Apoc.export.csv.all fails with huge database when streaming

m_hess · November 2, 2020, 11:07pm

Hi,

I want to export/download the whole database from an online neo4j 4.1.3 Enterprise. I would like to stream the data into my client application for processing. Therefore, I tried to use apoc.export.csv.all. I tested it in the browser, streaming only 5 small batches:

CALL apoc.export.csv.all(null, {stream: true, batchSize: 100})
YIELD data
RETURN data LIMIT 5

For small databases, this succeeds, for my large database (~200M nodes, ~250M relationships) the query fails with
Failed to invoke procedure 'apoc.export.csv.all': Caused by: java.lang.RuntimeException: Error polling, timeout of 100 seconds reached.

Am I on the right track, here, or should I do it differently?

EDIT:
I started by just streaming query results (e.g. MATCH (n) RETURN n) into my client application. But this seems to consume memory increasingly: After 500k nodes I get:

Neo4j.Driver.ClientException: The allocation of an extra 1.9 MiB would use more than the limit 1.0 GiB. Currently using 1022.9 MiB. dbms.memory.transaction.max_size threshold reached
   at Neo4j.Driver.Internal.MessageHandling.ResponsePipelineError.EnsureThrownIf(Func`2 predicate)
   at Neo4j.Driver.Internal.MessageHandling.ResponsePipelineError.EnsureThrown()
   at Neo4j.Driver.Internal.Result.ResultCursorBuilder.NextRecordAsync()
   at Neo4j.Driver.Internal.Result.ResultCursor.FetchAsync()

It seems that each returned node consumes roughly 2kb of data server-side. Is this to be expected?

Thanks in advance!
-- matt

david_allen · November 3, 2020, 11:49am

You might need to take a different approach here.

Neo4j has some limits on the total size of a transaction (which you've already seen) and there are also limits on how long a query can take before it times out. Both of these things are configurable, but if you're trying to export the entire database to CSV in one go, it might not be avoidable that you're doing a single TX that in the end is the size of the database. That's going to be tough unless you jack up your query timeout and also have more memory available than you have data on disk.

m_hess · November 3, 2020, 12:28pm

HI David,

thanks for your quick reply.

In theory, there is no need to keep all the data in memory for the transaction. It should just be an unbuffered stream to my client application. So question is, if technically the data is expected to be held in memory?

Are there other techniques for exporting all data into a standard format (like CSV or JSON)?

david_allen · November 3, 2020, 12:33pm

I'm not sure of all of the internal specifics of APOC CSV exports, to be honest.

Another option you might experiment with is something like this:

echo "MATCH (p:Person) RETURN p.name, p.age;" | cypher-shell -a localhost -u neo4j -p password --format plain > dump.csv

m_hess · November 4, 2020, 10:28am

Hi,

I wanted to given some feedback.

My current solution is to split the MATCH (n) RETURN n query into multiple queries, so that transaction memory is released at server-side.
The query looks like this: MATCH (n) WHERE (id(n) % $batchCount) = $batchIndex RETURN n. The number of batches can be determined easily. That seems to work pretty well.

Topic		Replies	Views
Apoc.export.csv.all fails with huge database when streaming Neo4j Graph Platform migrated	3	187	February 13, 2025
Export Millions of Data on Neo4j to CSV, JSON with Official Drivers Import / Export	3	3118	January 14, 2020
Apoc.export.json.all - can this work with a stream? Procedures & APOC apoc , export , json	2	1300	July 18, 2019
Exporting Data from Neo4j Feedback & Requests operations	6	452	November 20, 2020
Export CSV/JSON from Neo4j (remote) instance through desktop interface Import / Export export , csv , json , neo4j-desktop	1	1353	December 21, 2020

Get Certified in June!

Apoc.export.csv.all fails with huge database when streaming

Related topics