I'd like to ask a follow up question to the one posed on Stackoverflow here.
@lyonwj notes in his answer that "We can batch multiple queries into one transaction for better performance... Typically we can batch ~20k database operations in a single transaction. "
For convenience, I have pasted the example code below:
tx = graph.begin()
for index, row in df.iterrows():
tx.evaluate('''
MATCH (a:Label1 {property:$label1))
MERGE (a)-[r:R_TYPE]->(b:Label2 {property:$label2))
''', parameters = {'label1': row['label1'], 'label2': row['label2']})
tx.commit()
Well, what if the Pandas dataframe had much more than 20,000 rows? Suppose 10 million. I know that if we are using LOAD_CSV directly from the cypher-shell, we would include PERIODIC COMMIT 20000 to make it commit every 20000 lines of the CSV.
What would be the equivalent of using PERIODIC COMMIT 20000 for importing from a large dataframe and using py2neo?
The py2neo docs mention an optional autocommit argument to make each individual transaction automatically commit (almost the opposite of what I want). But I don't see anything about specifying PERIODIC COMMIT.
The closest I can think of to a workaround is, within the iterrows loop, is to just do a modulo operation on the row variable. And keep the final commit() outside the loop. So the modified code would look something like this:
tx = graph.begin()
for index, row in df.iterrows():
tx.evaluate('''
MATCH (a:Label1 {property:$label1))
MERGE (a)-[r:R_TYPE]->(b:Label2 {property:$label2))
''', parameters = {'label1': row['label1'], 'label2': row['label2']})
if row % 20000 == 0:
tx.commit()
tx.commit()
Would this be a viable workaround? Is there any other way?