How to make load csv go faster?

Greetings,

I have the following load csv cypher wrapped inside the apoc.periodic.iterate.

I tried to increase the batch size and I am getting locking errors. It is loading at very slow pace. What is a way to increase the loading speed and not get dead lock errors.

I tried increasing the batchsize and setting parallel to false but it is still very slow. I need to load three files like this each with 1.5 million records. It ran for almost two hours and all I could see was 500K records processed. I have a box with 24 cores and 96Gig memory. CPU is barely used.

Is there a way to make this go faster. ? Thanks a bunch in advance.

CALL apoc.periodic.iterate('

LOAD CSV WITH HEADERS FROM

"file:///<path>/filename.csv" AS line return line ','

MATCH (c:Company {name: line.COM_CODE})

MATCH (frt:FreightReadingTariff {name: toInteger(line.READ_TARIFF)})

MERGE (fb:FreightBasis {name: line.FBASIS})

CREATE (r1:Renege {ID: line.SERID + "_" + line.XXXX + "_" + line.YYYY + "_" + line.ABC + "_" + line.MNO + "_" + line.NNN,

serid: toInteger(line.SERID), xxxx: toInteger(line.XXXX), yyyy: line.YYYY, abc: line.ABC, mno: line.MNO, nnn: toInteger(line.NNN),

a1a: line.a1a, a2b: line.a2b, a2c: line.a2c, a3c: line.a3c, a4c: line.a4c, mmm: toInteger(line.MMM),

n12: toInteger(line.N12), mn1: line.MN1, eff1: line.eff1, disc1: line.disc1, ftp: line.ftp, stp: line.stp,

dowtype: line.dow, pcat: line.pcat, dcat: line.dcat, unav: line.unav, txtbl: toInteger( line.txtbl),

ch1: toInteger( line.ch1)})

MERGE (c)-[:company_to_renege]->(r1)

MERGE (fb)-[:freightbasis_to_renege]-(r1)

MERGE (frt)-[:freightreadingtariff_to_renege]-(r1)

',{batchSize:1, iterateList:true, parallel:true})

What version of Neo4j? version of APOC?
Do you have any indexes defined on
the labels involved in the query? Can you return the output of cypher statement

call db.indexes();

Version of Neo4j: 4.0.3
Version of APOC : 4.0.0.7
Yes, I have defined constraints on all the labels I am using.

|7|"constraint_117e43f5"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["Company"]|["name"]|"native-btree-1.0"|
|---|---|---|---|---|---|---|---|---|---|
|30|"constraint_b2b1ced2"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["FreightReadingTariff"]|["name"]|"native-btree-1.0"|
|21|"constraint_d2391f5c"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["Renege"]|["ID"]|"native-btree-1.0"|
|9|"constraint_e95fcfb3"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["FreightBasis"]|["name"]|"native-btree-1.0"|

was there a reason for wrapping the LOAD CSV inside a apoc.periodic.iterate and not just simple LOAD CSV?

Thanks @dana_canzano. I started with LOAD CSV but it way too slow as well so I thought putting it inside the apoc.periodic.iterate will increase the parallelism but not much luck.

I came across this tool called Kettle (Pentaho Data Integration) with the hopes that it allows me to load the data faster. It looks like this product is a hassle to even install. Wondering if anyone uses this with Neo4J plugins apart from just playing around?

how have you configured your conf/neo4j.conf and parameters

dbms.memory.heap.init_size
dbms.memory.heap.max_size
dbms.memory.pagecache.size

how much RAM is on the machine?

Have your prefaced the LOAD CSV statement with 'EXPLAIN` so as to understand the query plan generated?

In general it's not a good idea to executing loading queries that create relationships in parallel, since relationship creation requires locks on both nodes. In parallel in batches can lead to lock contention and deadlock.

A higher batch size without parallel tends to work fairly well.

You should of course make sure you have the right indexes created to support your initial MATCH and MERGE operations, indexes on:

:Company(name), :FrieghtReadingTariff(name), and :FreightBasis(name)

Also run an EXPLAIN of the query and ensure there are no Eager operators. While these are often needed for correct execution, they require manifestation of all results at that point in the plan, effectively disabling a periodic commit operation, which is another ingredient toward locking issues and potential deadlocks.

dbms.memory.heap.initial_size=28000m
dbms.memory.heap.max_size=28000m
dbms.memory.pagecache.size=45g

I think i ran a command which suggested these number and I am using them. It definitely got me out of the OutOfMemory exceptions I was getting earlier.

Total memory on the box is 96Gigs. I always see about 50Gigs of free memory.

I have been running EXPLAIN on the other queries but don't understand exactly how to interpret and make it better. Is there any video out there that can explain this EXPLAIN. :-).

Thanks again for your help.

Thank you @andrew_bowman .

I did create unique constraints on Company, FreightReadingTariff and FreightBasis as shown below.

CREATE CONSTRAINT ON (f:FreightBasis) ASSERT f.name IS UNIQUE;
CREATE CONSTRAINT ON (f: Company) ASSERT f.name IS UNIQUE;
CREATE CONSTRAINT ON (f:FrieghtReadingTariff) ASSERT f.name IS UNIQUE;

Is this what you mean by creating index. ?

I am thinking I should break this loading query into separate jobs so I can run the creation of nodes in batch and parallel and then create relationships in batch with no parallelism.

Thanks

hello guys ,
i am new to neo4j and i am using neo4j enterprise version on a server and i am running the query to load csv and form nodes but query is taking about 40 hours to run. my data size is about 1.7 million .
here is my query
is there any way it can run faster??

''':auto using periodic commit 2000
load csv with headers from "file:///data.csv" as line
merge (s:State{State:line.State})
merge (p:Product{ProductId:line.ProductId,Price:line.ProductPrice,ProductName:line.ProductName})
merge (c:Category{Category:line.Category})
merge (cr:CategoryRollUp{CategoryRollUp:line.CategoryRollUp})
merge (cu:Customer{CustomerId:line.CustomerId,Grouping:line.NewGrouping})
merge (cu)-[:InteractsWith{Date:line.TransactionDate,Month:line.Month,EventScore:line.EventScore,Event:line.Event,Quantity:line.QtySold}]->(p)
merge (s)-[:HasUser]->(cu)
merge (c)-[:HasProduct]->(p)
merge (cr)-[:HasCategory]->(c)"""

Hi @skmami,

Memory usually means RAM not HARD drive.

Could you please tell us how much data you have to load?
Are you loading via browser or trying Java/Python code?
Don't you see much help with Periodic Iterate?

Hello @shubham.

Welcome to Neo4J tool. It is the awesome product you will know very soon.

I ran into the same issue you are running into with loading the data which is taking very long time.

There is a technique you have to use to while loading the data to go faster.

I ran into this video on you YouTube which you might want to go through before you spend lot of time on trying to load the data.

It is 50 minutes videos but well worth the time.

There are some good tips on how to load faster.

Thanks

3 Likes

I have about 30gigs of data.
I am loading using LOAD CSV.

I am running into lock issues with periodic iterate.

Thanks

Please share query you are trying and also let us know free RAM you have?

@skmami thanks for recommending the video, it worked well.

Good to hear. Glad to help.

Hello,
I also have the same problem. Can you please share the Youtube video link also with me?