Improving data writing efficiency in python

atesterguy · April 10, 2020, 8:02pm

Hello.

I am using python 3.6 and my neo4j v 4.0

I am currently writing large amounts of data to the neo4j database and it is taking way longer than I hope it to. I could really use some ideas on how to improve my current way of writing data to the database.

At the moment I receive big JSON files with anything between 300 - 3000 values that need to be inserted into neo4j. Some are nodes and some are relations.

Currently i'm using the python neo4j library to write to the database by running the following code:

with driver.session() as session:

    for nodeData in data1:

        session.write_transaction(create_data_function_node, nodeData, fileData)

Which essentially means im running thousands of individual MERGE queries against the database.

Currently my longest "write" time was about 270 seconds which is too much considering that I can't multi-thread writing to neo4j because as far as I know having 2 threads will just end up blocking each other (I did some mild testing on it and it did not seem to help the situation...)

What can I implement in order to try and write data in a more efficient manner ?

david_allen · April 10, 2020, 8:41pm

You should submit them in batches rather than 1 at a time, and experiment with batch size, but generally something like batches of 10 - 20k is a good starting point depending on your data. As a simplified example you'll have to adapt for your use, consider starting with this:

CREATE (:Person { name: "Bob" });
CREATE (:Person { name: "Angela" });
CREATE (:Person { name: "Sarah" });

And instead doing this:

WITH [{ name: "Bob" }, { name: "Sarah" }, { name: "Angela" }] as batch
UNWIND batch as individual
CREATE (p:Person)
SET p += individual;

Then make your batches 10-20k in size, and you're going to see a massive speedup. The reason why is simple: every time you run a transaction, there is overhead to ensure the transaction is properly committed and so forth. If you do records 1-by-1, you're paying that overhead over and over again.

atesterguy · April 10, 2020, 9:00pm

Thank you for your reply.
Just to make it clear I am a beginner with neo4j. I don't quite understand how to create an iterator that can do this for a data set.

If I understood you clearly the actual query should look something like this:

strCypher = """

WITH $batch as batch

UNWIND batch as individual

CREATE (p:Person)

SET p += individual;

"""

tx.run(strCypher, batch=batch)

I am note quite sure how to create the $batch how should i be formatted python wise ? a list of dictionaries ?

david_allen · April 10, 2020, 9:43pm

A list of dictionaries would exactly do the trick, and submitting them as a parameter as in your example is the right way to approach it.

atesterguy · April 10, 2020, 10:03pm

Great ! Thank you :)

atesterguy · April 11, 2020, 12:05pm

Currently the method you proposed uses CREATE but I must use MERGE in the node creation because I have a lot of duplicates in the incoming data.

When I tried changing the current cypher to MERGE it just creates 1 node ^^ because he does not take the properties into consideration.

How could I do batch query wit a MERGE ?

Also, Would it be okay to ask how to implement the same with relation creation wit MERGE as well ?

I have 4 WHERE clauses in my relation creation so the same logic does not seem to work.

MATCH (a),(b)

    WHERE a.data1 = $data1 AND b.data1 = $data1 

    AND a.data2 = $data2 AND b.data2 = $data2 

    MERGE (a)-[r:Relation { rel1: $rel1, rel2: $rel2}]->(b)

How do I batch the WHERE clause ?

david_allen · April 11, 2020, 11:36pm

You don't need to batch the WHERE clause. You should separate the matches though for performance, like what I have below, it will improve performance. As for the batching, it doesn't matter whether you use create or merge. I used create just as an example because I didn't know what your code was doing. Either way it will work.

Rewrite to this:

WITH $batch as batch
UNWIND batch as record
MATCH (a)
WHERE a.data1 = record.data1 AND a.data2 = record.data2
WITH a
MATCH (b)
WHERE b.data1 = $record.data1 AND b.data2 = record.data2
MERGE (a)-[r:Relation { /* Whatever here */ }]->(b)

Your overall performance will also go a lot better if you put labels on these nodes and don't merge them unlabeled like this.

atesterguy · April 12, 2020, 1:30am

Got it.

Thank you very much I seem to be getting the hang of this now !!

Just a note !

When I ran the query using the current syntax I got an error saying that record could not be found once it tried to

MATCH (b)

I then removed the

WITH a

line from the code and it worked !
I think the second WITH in the syntax essentially overwrites the previous

WITH $batch as batch

that was used to read from $batch.

Topic		Replies	Views
How to accelerate writing Neo4j Graph Platform	6	721	February 6, 2020
Improving performance/time of writing large batches of nodes and relationships Cypher performance , cypher , operations	1	1696	December 19, 2019
How to handle large data insertion in Neo4j Operations performance , cypher	1	2235	July 26, 2021
How to write neo4j in python with neo4j-spark-connector Neo4j Graph Platform	12	2143	November 12, 2020
Python neo4j driver write transaction Python	0	597	July 27, 2020

Get Certified in June!

Improving data writing efficiency in python

Related topics