cancel
Showing results for 
Search instead for 
Did you mean: 

Apoc.export.csv.all fails with huge database when streaming

m_hess
Node

Hi,

I want to export/download the whole database from an online neo4j 4.1.3 Enterprise. I would like to stream the data into my client application for processing. Therefore, I tried to use apoc.export.csv.all. I tested it in the browser, streaming only 5 small batches:

CALL apoc.export.csv.all(null, {stream: true, batchSize: 100})
YIELD data
RETURN data LIMIT 5

For small databases, this succeeds, for my large database (~200M nodes, ~250M relationships) the query fails with
Failed to invoke procedure 'apoc.export.csv.all': Caused by: java.lang.RuntimeException: Error polling, timeout of 100 seconds reached.

Am I on the right track, here, or should I do it differently?

EDIT:
I started by just streaming query results (e.g. MATCH (n) RETURN n) into my client application. But this seems to consume memory increasingly: After 500k nodes I get:

Neo4j.Driver.ClientException: The allocation of an extra 1.9 MiB would use more than the limit 1.0 GiB. Currently using 1022.9 MiB. dbms.memory.transaction.max_size threshold reached
   at Neo4j.Driver.Internal.MessageHandling.ResponsePipelineError.EnsureThrownIf(Func`2 predicate)
   at Neo4j.Driver.Internal.MessageHandling.ResponsePipelineError.EnsureThrown()
   at Neo4j.Driver.Internal.Result.ResultCursorBuilder.NextRecordAsync()
   at Neo4j.Driver.Internal.Result.ResultCursor.FetchAsync()

It seems that each returned node consumes roughly 2kb of data server-side. Is this to be expected?

Thanks in advance!
-- matt

5 REPLIES 5

david_allen
Neo4j
Neo4j

You might need to take a different approach here.

Neo4j has some limits on the total size of a transaction (which you've already seen) and there are also limits on how long a query can take before it times out. Both of these things are configurable, but if you're trying to export the entire database to CSV in one go, it might not be avoidable that you're doing a single TX that in the end is the size of the database. That's going to be tough unless you jack up your query timeout and also have more memory available than you have data on disk.

HI David,

thanks for your quick reply.

In theory, there is no need to keep all the data in memory for the transaction. It should just be an unbuffered stream to my client application. So question is, if technically the data is expected to be held in memory?

Are there other techniques for exporting all data into a standard format (like CSV or JSON)?

I'm not sure of all of the internal specifics of APOC CSV exports, to be honest.

Another option you might experiment with is something like this:

echo "MATCH (p:Person) RETURN p.name, p.age;" | cypher-shell -a localhost -u neo4j -p password --format plain > dump.csv

Hi David,

Where can I find the configuration parameter for apoc.export.csv.query() timeout?

I've only found dbms.transactions.timeout which default is 0 seconds, also dbms.lock.acquisition.timeout which is also set to 0 seconds as default.

Thanks

m_hess
Node

Hi,

I wanted to given some feedback.

My current solution is to split the MATCH (n) RETURN n query into multiple queries, so that transaction memory is released at server-side.
The query looks like this: MATCH (n) WHERE (id(n) % $batchCount) = $batchIndex RETURN n. The number of batches can be determined easily. That seems to work pretty well.

Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

On November 16 and 17 for 24 hours across all timezones, you’ll learn about best practices for beginners and experts alike.