Should load csv or unwind be used in MERGE situation with two ec2 servers (complexity vs performance)?

richard.lin · August 12, 2020, 9:47pm

(Using neo4j official python driver) I am wondering if unwind or load csv should be used in a situation where there are two ec2 servers. One server contains a neo4j instance. The other server contains ETL scripts. Using load csv would be rather complex, as I am reading csv's from s3, cleaning them, making new csv's (re-uploading to s3), and then presigning url's to use load csv. If I were to use unwind then there would be no pre-signing or re-uploading of csv files (basic unwind cypher statement). Below are sample code segments. The load csv one requires a csv file to be referenced ($link). Performance wise, unwind seems to be slow even with indexes, but it is less complex than load csv.

# unwind 
def largeUnwind(driver, batch, label):
    with driver.session() as session:
        session.run("UNWIND $batch AS props MERGE (n:" + label +"{" + "name:props.name" + "})" + " SET n = props" , batch=batch, label=label)

# load csv
def largeInsert(driver, body, label):
    with driver.session() as session:
        session.run("LOAD CSV WITH HEADERS FROM $link AS line MERGE (s:" + label + "{name:" + "line.name}) SET s=line", link=body)

david_rosenblum · August 13, 2020, 8:47pm

Hi Richard,

Welcome to the forum.

I wonder if you adjust the size of what you passed to the unwind that you could see better performance. You can also try adding USING PERIODIC COMMIT 5000 (or some other number) to the load csv.

Can you tell me what your heap and page cache settings are?

Thank you

richard.lin · August 14, 2020, 1:34pm

Hi David,

Thanks for the reply. Below, is my current settings for heap and page cache. I have tried using periodic commit to load csv, and it seems to work faster than unwinding batches. Another issue is I am wondering if it would be better to use unwind vs load csv for the case of having my python script on one ec2 server and the neo4j database on the other ec2 server. The reason why is because there are limitations with load csv. It seems it can only be used on csv files with presigned urls or the files have to be on the same server as neo4j database.

dbms.memory.pagecache.size=512m

dbms.memory.heap.initial_size=3G
dbms.memory.heap.max_size=5G

david_rosenblum · August 14, 2020, 5:38pm

Are the two EC2 instances in the same region?
If the file is available via https:// from s3, the load csv can be on another server.
You'll note I specify the fields one by one so that I can alter the type. SET s=line will set properties always be a string.
e.g.
LOAD CSV WITH HEADERS FROM 'https://s3.eu-north-1.amazonaws.com/com.neo4j.gds.browser-guide/data/battles.csv' AS row
MERGE (b:Battle {name: row.name})
ON CREATE SET b.year = toInteger(row.year),
b.summer = row.summer,
b.major_death = row.major_death,
b.major_capture = row.major_capture,
b.note = row.note,
b.battle_type = row.battle_type,
b.attacker_size = toInteger(row.attacker_size),
b.defender_size = toInteger(row.defender_size);

A couple of suggestions.
Make initial and max heap the same to avoid the GC when the heap grows.
I make page cache the same size as the database on disk if possible.

Topic		Replies	Views
Query of load csv not completing even after 12 hours Cypher cypher	9	1381	June 19, 2019
Fastest way to load data in neo4j using python Cypher	5	9929	May 5, 2021
Load large CSV with LOAD CSV or python Neo4j Graph Platform migrated	2	1141	August 4, 2023
How to make load csv go faster? Cypher	17	4055	November 20, 2023
Load-CSV very slow with millions of nodes Import / Export load-csv , import , neo4j-import , csv , neo4j	10	11591	April 7, 2022

Demystifying Neo4j UX Research

Should load csv or unwind be used in MERGE situation with two ec2 servers (complexity vs performance)?

Related topics