How to make load csv go faster?

skmami · April 22, 2020, 11:02pm

Greetings,

I have the following load csv cypher wrapped inside the apoc.periodic.iterate.

I tried to increase the batch size and I am getting locking errors. It is loading at very slow pace. What is a way to increase the loading speed and not get dead lock errors.

I tried increasing the batchsize and setting parallel to false but it is still very slow. I need to load three files like this each with 1.5 million records. It ran for almost two hours and all I could see was 500K records processed. I have a box with 24 cores and 96Gig memory. CPU is barely used.

Is there a way to make this go faster. ? Thanks a bunch in advance.

CALL apoc.periodic.iterate('

LOAD CSV WITH HEADERS FROM

"file:///<path>/filename.csv" AS line return line ','

MATCH (c:Company {name: line.COM_CODE})

MATCH (frt:FreightReadingTariff {name: toInteger(line.READ_TARIFF)})

MERGE (fb:FreightBasis {name: line.FBASIS})

CREATE (r1:Renege {ID: line.SERID + "_" + line.XXXX + "_" + line.YYYY + "_" + line.ABC + "_" + line.MNO + "_" + line.NNN,

serid: toInteger(line.SERID), xxxx: toInteger(line.XXXX), yyyy: line.YYYY, abc: line.ABC, mno: line.MNO, nnn: toInteger(line.NNN),

a1a: line.a1a, a2b: line.a2b, a2c: line.a2c, a3c: line.a3c, a4c: line.a4c, mmm: toInteger(line.MMM),

n12: toInteger(line.N12), mn1: line.MN1, eff1: line.eff1, disc1: line.disc1, ftp: line.ftp, stp: line.stp,

dowtype: line.dow, pcat: line.pcat, dcat: line.dcat, unav: line.unav, txtbl: toInteger( line.txtbl),

ch1: toInteger( line.ch1)})

MERGE (c)-[:company_to_renege]->(r1)

MERGE (fb)-[:freightbasis_to_renege]-(r1)

MERGE (frt)-[:freightreadingtariff_to_renege]-(r1)

',{batchSize:1, iterateList:true, parallel:true})

dana_canzano · April 23, 2020, 12:15am

What version of Neo4j? version of APOC?
Do you have any indexes defined on
the labels involved in the query? Can you return the output of cypher statement

call db.indexes();

skmami · April 23, 2020, 4:49am

Version of Neo4j: 4.0.3
Version of APOC : 4.0.0.7
Yes, I have defined constraints on all the labels I am using.

|7|"constraint_117e43f5"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["Company"]|["name"]|"native-btree-1.0"|
|---|---|---|---|---|---|---|---|---|---|
|30|"constraint_b2b1ced2"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["FreightReadingTariff"]|["name"]|"native-btree-1.0"|
|21|"constraint_d2391f5c"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["Renege"]|["ID"]|"native-btree-1.0"|
|9|"constraint_e95fcfb3"|"ONLINE"|100.0|"UNIQUE"|"BTREE"|"NODE"|["FreightBasis"]|["name"]|"native-btree-1.0"|

dana_canzano · April 23, 2020, 8:35pm

was there a reason for wrapping the LOAD CSV inside a apoc.periodic.iterate and not just simple LOAD CSV?

skmami · April 23, 2020, 9:01pm

Thanks @dana_canzano. I started with LOAD CSV but it way too slow as well so I thought putting it inside the apoc.periodic.iterate will increase the parallelism but not much luck.

skmami · April 23, 2020, 10:58pm

I came across this tool called Kettle (Pentaho Data Integration) with the hopes that it allows me to load the data faster. It looks like this product is a hassle to even install. Wondering if anyone uses this with Neo4J plugins apart from just playing around?

dana_canzano · April 23, 2020, 11:56pm

how have you configured your conf/neo4j.conf and parameters

dbms.memory.heap.init_size
dbms.memory.heap.max_size
dbms.memory.pagecache.size

how much RAM is on the machine?

Have your prefaced the LOAD CSV statement with 'EXPLAIN` so as to understand the query plan generated?

andrew_bowman · April 24, 2020, 12:01am

In general it's not a good idea to executing loading queries that create relationships in parallel, since relationship creation requires locks on both nodes. In parallel in batches can lead to lock contention and deadlock.

A higher batch size without parallel tends to work fairly well.

You should of course make sure you have the right indexes created to support your initial MATCH and MERGE operations, indexes on:

:Company(name), :FrieghtReadingTariff(name), and :FreightBasis(name)

Also run an EXPLAIN of the query and ensure there are no Eager operators. While these are often needed for correct execution, they require manifestation of all results at that point in the plan, effectively disabling a periodic commit operation, which is another ingredient toward locking issues and potential deadlocks.

skmami · April 24, 2020, 1:51am

dbms.memory.heap.initial_size=28000m
dbms.memory.heap.max_size=28000m
dbms.memory.pagecache.size=45g

I think i ran a command which suggested these number and I am using them. It definitely got me out of the OutOfMemory exceptions I was getting earlier.

Total memory on the box is 96Gigs. I always see about 50Gigs of free memory.

I have been running EXPLAIN on the other queries but don't understand exactly how to interpret and make it better. Is there any video out there that can explain this EXPLAIN. :-).

Thanks again for your help.

skmami · April 24, 2020, 1:58am

Thank you @andrew_bowman .

I did create unique constraints on Company, FreightReadingTariff and FreightBasis as shown below.

CREATE CONSTRAINT ON (f:FreightBasis) ASSERT f.name IS UNIQUE;
CREATE CONSTRAINT ON (f: Company) ASSERT f.name IS UNIQUE;
CREATE CONSTRAINT ON (f:FrieghtReadingTariff) ASSERT f.name IS UNIQUE;

Is this what you mean by creating index. ?

I am thinking I should break this loading query into separate jobs so I can run the creation of nodes in batch and parallel and then create relationships in batch with no parallelism.

Thanks

shubham · April 28, 2020, 10:14am

hello guys ,
i am new to neo4j and i am using neo4j enterprise version on a server and i am running the query to load csv and form nodes but query is taking about 40 hours to run. my data size is about 1.7 million .
here is my query
is there any way it can run faster??

''':auto using periodic commit 2000
load csv with headers from "file:///data.csv" as line
merge (s:State{State:line.State})
merge (p:Product{ProductId:line.ProductId,Price:line.ProductPrice,ProductName:line.ProductName})
merge (c:Category{Category:line.Category})
merge (cr:CategoryRollUp{CategoryRollUp:line.CategoryRollUp})
merge (cu:Customer{CustomerId:line.CustomerId,Grouping:line.NewGrouping})
merge (cu)-[:InteractsWith{Date:line.TransactionDate,Month:line.Month,EventScore:line.EventScore,Event:line.Event,Quantity:line.QtySold}]->(p)
merge (s)-[:HasUser]->(cu)
merge (c)-[:HasProduct]->(p)
merge (cr)-[:HasCategory]->(c)"""

intouch_vivek · April 28, 2020, 10:57am

Hi @skmami,

Memory usually means RAM not HARD drive.

Could you please tell us how much data you have to load?
Are you loading via browser or trying Java/Python code?
Don't you see much help with Periodic Iterate?

skmami · April 28, 2020, 3:47pm

Hello @shubham.

Welcome to Neo4J tool. It is the awesome product you will know very soon.

I ran into the same issue you are running into with loading the data which is taking very long time.

There is a technique you have to use to while loading the data to go faster.

I ran into this video on you YouTube which you might want to go through before you spend lot of time on trying to load the data.

It is 50 minutes videos but well worth the time.

There are some good tips on how to load faster.

Thanks

skmami · April 28, 2020, 3:49pm

I have about 30gigs of data.
I am loading using LOAD CSV.

I am running into lock issues with periodic iterate.

Thanks

intouch_vivek · April 28, 2020, 5:15pm

Please share query you are trying and also let us know free RAM you have?

shubham · April 30, 2020, 2:20pm

@skmami thanks for recommending the video, it worked well.

skmami · April 30, 2020, 8:54pm

Good to hear. Glad to help.

niloufar.farrokhzad · November 20, 2023, 2:32pm

Hello,
I also have the same problem. Can you please share the Youtube video link also with me?

Topic		Replies	Views
How to make load csv go faster? Neo4j Graph Platform migrated , cypher-tagged	2	191	December 23, 2022
How to make this load csv go faster? Procedures & APOC	5	444	May 15, 2020
How can I improve the performance of this query? Newbie Questions	5	1386	April 4, 2019
How to speed up uploading data from csv in graph db Cypher apoc , cypher , bolt , import	5	4472	August 29, 2019
APOC Function apoc.load.csv() Slowing Down After Importing ~20 Million Nodes Procedures & APOC	10	490	July 6, 2021

How to make load csv go faster?

Related topics