Duplicate Edges Created in Neo4j with Bulk Dataframe ( via Databricks Pyspark Code )

fahim.mail18 · January 30, 2025, 4:00pm

Duplicate Edges Created in Neo4j with Bulk Dataframe ( via Databricks Pyspark Code ) using Dataframe.write()

fahim.mail18 · January 30, 2025, 4:02pm

It work when repartition is set to 1 for dataframe. But take 3-4 hours to load relationships for around 60M data with batch size 10000

michael.hunger · January 30, 2025, 4:46pm

Can you share more details about your data source, structure, rows and your code?

fahim.mail18 · January 31, 2025, 12:26am

I have one dataframe having ~50M data with 2 columns as Key1 and Key2 for relationship.

Dataframe Example :

Key1 | Key 2
123 | XYZ
123 | ABC
453 | PRQ
453 | XYZ
453 | LFR
876 | OPE
876 | ZQU

(df
.write
.format("org.neo4j.spark.DataSource")
.option("url", URL)
.option("relationship", "Relationship_Name")
.option("relationship.save.strategy", "keys")
.option("relationship.source.labels", ":Label1")
.option("relationship.source.save.mode", "MATCH")
.option("relationship.source.node.keys", "Key1")
.option("relationship.target.labels", ":Label2")
.option("relationship.target.node.keys", "Key2")
.option("batch.size", "10000")
.mode("append")
.save()
)

I am using connector "neo4j-connector-apache-spark_2.12-5.3.1_for_spark_3.jar"

Problem : If there are ~50M records in dataframe then around 70M relationship will be created. But In case of df.repartiton(1). This will work fine but takes too long.

Topic		Replies	Views
Relationship creation taking a lot of time using spark connector Cypher apoc , cypher	2	150	May 30, 2024
How to insert huge amount of nodes and relationships into neo4j through Spark Scala? Neo4j Graph Platform performance , cypher , import , knowledge-base , neo4j-desktop	2	23	April 7, 2025
Can we find any benchmarking figures for neo4j spark connector (DataFrame to DB) Neo4j Graph Platform	1	464	November 12, 2020
Suggest the best way to create nodes and relationships using spark job? Cypher cypher	3	44	March 7, 2025
Unable to write relationships using neo4j spark connector with spark 3.1.1 Neo4j Graph Platform import	0	751	April 15, 2021

July Summer Fun!

Duplicate Edges Created in Neo4j with Bulk Dataframe ( via Databricks Pyspark Code )

Related topics