We have a "very big" data challenge, whereby we are trying to leverage Neo4J Cypher to reduce the data from 10^10 records down to 10^6 records. The approach is to create 10^2 DB instances, each with about 10^8 relationships and nodes, operating in batch.
Each instance needs to export about 10^5 records resulting from a query. Whereas the import is impressively fast, the best export performance we're getting is about 200 records per second. Clearly, it is not possible to export 10^5 records using such a trickle.
To be clear, the import speed is good: I can import 6M nodes, 33M relationships and 19M properties within 200-240 secs. For most cases, this is sufficient to load about 10^8 records within minutes.
The queries are performing well, starting to stream within less than a sec; no issue there.
The export throughput is not acceptable. I have tested 3 approaches for the export:
Option 1: Using cypher shell pipes:
"cat query.txt | ./bin/cypher-shell -u ... -p ... --format plain > result.txt"
Option 2: Using python py2neo
Option 3: Using REST via the webapp.
The performance results of options 1 and 2 are almost identical, peaking at 200 records per second, with flattening occurring above 20,000 records. Specifically, retrieving 22280 requires 111.67 seconds. Note that this query starts streaming within <50msec; all the time is spent streaming!
With Option 3, the same 22280 records require 115 seconds, implying that the REST API overhead is <4%.
I was expecting to achieve >10,000 records per second.
How do I get the export to be >50x faster???