Relationship creation taking a lot of time using spark connector

madil.bscs19seecs · May 27, 2024, 8:35am

Hi, I'm creating a graph with approximately 200 million nodes and 500 million relationships. Node creation doesn't take a long time but the problem lies in relationship creation. I'm using a pyspark data frame to write my data using the spark connector. This is my query to write relationships

WITH event
                    LIMIT 1
                    CALL apoc.periodic.iterate(
                    'CYPHER runtime=parallel UNWIND $batch as event MATCH (a:EDISubscriberHL {TransactionID: event.Encounter2}) MATCH (b:EDIClaim {PatientAccountNumber: event.Claims_ClaimInfo_PatientAccountNumber}) WHERE NOT EXISTS((a)-[:HAS_CLAIM]->(b)) RETURN event,  a,   b',

                    'CREATE (a)-[r:HAS_CLAIM {}]->(b)',
                    {batchSize:1000, params:{batch: $events}, parallel:false}
                    )
                    YIELD batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations,    wasTerminated,failedParams,updateStatistics
                    RETURN batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations,   wasTerminated,failedParams,updateStatistics;

this is the batch size I'm using with the spark connector

 sub_df3.write.format("org.neo4j.spark.DataSource") \
                    .mode("Overwrite") \
                    .option("url", URL) \
                    .option("authentication.basic.username", USER) \
                    .option("authentication.basic.password", PASSWORD) \
                    .option("database", DATABASE)\
                    .option("batch.size", 5000)\
                    .option("query", query)\
                    .option("transaction.retries", 5)\
                    .save()

My question is what should be the batch size in the apoc.periodic.iterate query as well as in the spark connector parameter batch size

glilienfield · May 27, 2024, 12:49pm

First, do you have indexes defined on the properties you are matching on: EDISubscriberHL(TransactionID) and EDIClaim(PatientAccountNumber)

The batch size depends on memory. I read in an article by Michael hunger that batches of 10,000 should be ok. As such, maybe you don’t need periodic iterate when you data frame passes 5000 rows.

madil.bscs19seecs · May 30, 2024, 9:39am

yes I have indexes on these particular properties. Approximately 180 GB of RAM is allocated to neo4j. I'll try 10,000

Topic		Replies	Views
Using distributed computing to write to Neo4j Neo4j Graph Platform migrated	3	170	June 22, 2022
Creating 200K relationships to a node is taking a lot of time in Neo4J 3.5? Neo4j Graph Platform	18	7258	November 18, 2021
Duplicate Edges Created in Neo4j with Bulk Dataframe ( via Databricks Pyspark Code ) Aura & Cloud	3	46	January 31, 2025
Creating relationship between millions of nodes and runnning out of heap memory Cypher apoc , cypher	9	1836	February 20, 2020
How to write neo4j in python with neo4j-spark-connector Neo4j Graph Platform	12	2143	November 12, 2020

July Summer Fun!

Relationship creation taking a lot of time using spark connector

Related topics