Improving data writing efficiency in python

Hello.

I am using python 3.6 and my neo4j v 4.0

I am currently writing large amounts of data to the neo4j database and it is taking way longer than I hope it to. I could really use some ideas on how to improve my current way of writing data to the database.

At the moment I receive big JSON files with anything between 300 - 3000 values that need to be inserted into neo4j. Some are nodes and some are relations.

Currently i'm using the python neo4j library to write to the database by running the following code:

with driver.session() as session:

    for nodeData in data1:

        session.write_transaction(create_data_function_node, nodeData, fileData)

Which essentially means im running thousands of individual MERGE queries against the database.

Currently my longest "write" time was about 270 seconds which is too much considering that I can't multi-thread writing to neo4j because as far as I know having 2 threads will just end up blocking each other (I did some mild testing on it and it did not seem to help the situation...)

What can I implement in order to try and write data in a more efficient manner ?

You should submit them in batches rather than 1 at a time, and experiment with batch size, but generally something like batches of 10 - 20k is a good starting point depending on your data. As a simplified example you'll have to adapt for your use, consider starting with this:

CREATE (:Person { name: "Bob" });
CREATE (:Person { name: "Angela" });
CREATE (:Person { name: "Sarah" });

And instead doing this:

WITH [{ name: "Bob" }, { name: "Sarah" }, { name: "Angela" }] as batch
UNWIND batch as individual
CREATE (p:Person)
SET p += individual;

Then make your batches 10-20k in size, and you're going to see a massive speedup. The reason why is simple: every time you run a transaction, there is overhead to ensure the transaction is properly committed and so forth. If you do records 1-by-1, you're paying that overhead over and over again.

1 Like

Thank you for your reply.
Just to make it clear I am a beginner with neo4j. I don't quite understand how to create an iterator that can do this for a data set.

If I understood you clearly the actual query should look something like this:

strCypher = """

WITH $batch as batch

UNWIND batch as individual

CREATE (p:Person)

SET p += individual;

"""

tx.run(strCypher, batch=batch)

I am note quite sure how to create the $batch how should i be formatted python wise ? a list of dictionaries ?

A list of dictionaries would exactly do the trick, and submitting them as a parameter as in your example is the right way to approach it.

1 Like

Great ! Thank you :)

Currently the method you proposed uses CREATE but I must use MERGE in the node creation because I have a lot of duplicates in the incoming data.

When I tried changing the current cypher to MERGE it just creates 1 node ^^ because he does not take the properties into consideration.

How could I do batch query wit a MERGE ?

Also, Would it be okay to ask how to implement the same with relation creation wit MERGE as well ?

I have 4 WHERE clauses in my relation creation so the same logic does not seem to work.

MATCH (a),(b)

    WHERE a.data1 = $data1 AND b.data1 = $data1 

    AND a.data2 = $data2 AND b.data2 = $data2 

    MERGE (a)-[r:Relation { rel1: $rel1, rel2: $rel2}]->(b)

How do I batch the WHERE clause ?

You don't need to batch the WHERE clause. You should separate the matches though for performance, like what I have below, it will improve performance. As for the batching, it doesn't matter whether you use create or merge. I used create just as an example because I didn't know what your code was doing. Either way it will work.

Rewrite to this:

WITH $batch as batch
UNWIND batch as record
MATCH (a)
WHERE a.data1 = record.data1 AND a.data2 = record.data2
WITH a
MATCH (b)
WHERE b.data1 = $record.data1 AND b.data2 = record.data2
MERGE (a)-[r:Relation { /* Whatever here */ }]->(b)

Your overall performance will also go a lot better if you put labels on these nodes and don't merge them unlabeled like this.

1 Like

Got it.

Thank you very much I seem to be getting the hang of this now !!

Just a note !

When I ran the query using the current syntax I got an error saying that record could not be found once it tried to

MATCH (b)

I then removed the

WITH a

line from the code and it worked !
I think the second WITH in the syntax essentially overwrites the previous

WITH $batch as batch

that was used to read from $batch.