Hi, I'm creating a graph with approximately 200 million nodes and 500 million relationships. Node creation doesn't take a long time but the problem lies in relationship creation. I'm using a pyspark data frame to write my data using the spark connector. This is my query to write relationships
WITH event
LIMIT 1
CALL apoc.periodic.iterate(
'CYPHER runtime=parallel UNWIND $batch as event MATCH (a:EDISubscriberHL {TransactionID: event.Encounter2}) MATCH (b:EDIClaim {PatientAccountNumber: event.Claims_ClaimInfo_PatientAccountNumber}) WHERE NOT EXISTS((a)-[:HAS_CLAIM]->(b)) RETURN event, a, b',
'CREATE (a)-[r:HAS_CLAIM {}]->(b)',
{batchSize:1000, params:{batch: $events}, parallel:false}
)
YIELD batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations, wasTerminated,failedParams,updateStatistics
RETURN batches,total,timeTaken,committedOperations,failedOperations,failedBatches,retries,errorMessages,batch,operations, wasTerminated,failedParams,updateStatistics;
this is the batch size I'm using with the spark connector
sub_df3.write.format("org.neo4j.spark.DataSource") \
.mode("Overwrite") \
.option("url", URL) \
.option("authentication.basic.username", USER) \
.option("authentication.basic.password", PASSWORD) \
.option("database", DATABASE)\
.option("batch.size", 5000)\
.option("query", query)\
.option("transaction.retries", 5)\
.save()
My question is what should be the batch size in the apoc.periodic.iterate query as well as in the spark connector parameter batch size