Can we find any benchmarking figures for neo4j spark connector (DataFrame to DB)

I am wondering if anybody can point me to any benchmarking figures of Spark DataFrame writes to neo4j database using neo4j-spark-connector

I am currently using the following versions on a 60 core/ 60 executor cluster.

I am using neo4j version = 3.5
Spark 2.4.0

Using Neo4jDataFrame.mergeEdgeList(), I have tried using batch sizes (10k, 20k and 40k)

However, it seems to take unreasonable amount of time.

100k record takes about 35 minutes. For a million records , it seemed to be hanging for more than 14hrs. The seems to be no progress in Spark UI and all tasks show 0/100

What is the expected write rates to neo4j database using Spark connector and what is the best way to optimise larger dataframes (containing millions of records) to ensure faster loads.


Neo4j has a new approach to the spark connector which can be found here, and includes architectural guidance for getting best performance

It's hard to say exactly what performance each user will gets because it depends heavily on your data model and setup. But we have seen tens of thousands of node writes per second on moderate hardware, for nodes consisting of say 10 or so properties, when written using the "normalized loading" approach that's documented on that page.