Duplicate Edges Created in Neo4j with Bulk Dataframe ( via Databricks Pyspark Code )

Duplicate Edges Created in Neo4j with Bulk Dataframe ( via Databricks Pyspark Code ) using Dataframe.write()

It work when repartition is set to 1 for dataframe. But take 3-4 hours to load relationships for around 60M data with batch size 10000

Can you share more details about your data source, structure, rows and your code?

I have one dataframe having ~50M data with 2 columns as Key1 and Key2 for relationship.

Dataframe Example :

Key1 | Key 2
123 | XYZ
123 | ABC
453 | PRQ
453 | XYZ
453 | LFR
876 | OPE
876 | ZQU


(df
.write
.format("org.neo4j.spark.DataSource")
.option("url", URL)
.option("relationship", "Relationship_Name")
.option("relationship.save.strategy", "keys")
.option("relationship.source.labels", ":Label1")
.option("relationship.source.save.mode", "MATCH")
.option("relationship.source.node.keys", "Key1")
.option("relationship.target.labels", ":Label2")
.option("relationship.target.node.keys", "Key2")
.option("batch.size", "10000")
.mode("append")
.save()
)

I am using connector "neo4j-connector-apache-spark_2.12-5.3.1_for_spark_3.jar"

Problem : If there are ~50M records in dataframe then around 70M relationship will be created. But In case of df.repartiton(1). This will work fine but takes too long.