Query of load csv not completing even after 12 hours


(Bharat Ram Ammu) #1

Hi, I have been using Neo4j for quite a while now. I ran this query earlier before my computer crashed 7 days ago and somehow unable to run it now. I need to create a graph database out of a csv of bank transactions. The original dataset has around 5 million rows and has around 60 columns. This is the query I used, starting from 'Export CSV from real data' demo by Nicole White:

LOAD CSV WITH HEADERS FROM "file:///Transactions_with_risk_scores.csv" AS line
MERGE (transaction:Transaction {id:line.TRANSACTION_ID}) 
SET transaction.base_currency_amount =toInteger(line.AMOUNT_IN_BASE_CURRENCY),
transaction.base_currency = line.BASE_CURRENCY,
transaction.cd_code = line.CREDIT_DEBIT_CODE,
transaction.txn_type_code = line.TRANSACTION_TYPE_CODE,
transaction.instrument = line.INSTRUMENT,
transaction.region= line.REGION,
transaction.scope = line.SCOPE,
transaction.year = toInteger(date[2]),
transaction.month = toInteger(date[1]),
transaction.day = toInteger(date[0]);

I tried:
Using LIMIT 0 before running query as per Micheal Hunger's suggestion in a post about 'Loading Large datasets'.
Used single MERGE per statement (this is first merge and there are 4 other merges to be used) as suggested by Michael again in another post.
Tried CALL apoc.periodic.iterate and apoc.cypher.parallel but doesn't work with LOAD CSV (seem to work only with MERGE and CREATE queries without LOAD CSV).
Increased max heap size to 16G as my laptop is of 16GB RAM. Btw finding it difficult to write this post as I tried running again now with 'PROFILE ' and it is still running since an hour.

Help needed to load query of this 5 million rows dataset. Any help would highly be appreciated.Thanks in advance! I am using Neo4j 3.5.1 on PC.

(Michael Hunger) #2

Do you have duplicate entries in your file? if not you don't need the distinct
I would also move the filter up before the split.

Did you create the constraint for :Transaciton(id) upfront?

Check your statement with EXPLAIN so you see what it does.

I would use heap-size: 12G, page-cache 2G and

(Bharat Ram Ammu) #3

No, it doesn't . I will remove the DISTINCT. Do you think that would make a difference anyway?
I am sorry, what do you mean by moving the filter up?

Yes, I created constraint for Transaction node
The explain statement gives a plan as shown in figure with either 0 or 1 estimated rows. Can't understand it:

Oh, interesting. Why 12G? Does it give some cushion for a 16GB RAM?
Can't find page-cache in neo4j.conf file. Where can I set it?

(Bharat Ram Ammu) #4

The key was to NOT USE DISTINCT. Thanks again Michael, it works. Struggled with this for 4 days!

(Michael Hunger) #5

The OS also needs memory as does teh database for page-cache.

It's odd, so you have 5M lines? how many columns, and how long is a line (in chars)?
Or do you have an example line?

It's odd that the distinct affects the runtime so much, would be good to test and fix if it's an issue.

(Bharat Ram Ammu) #6

Hi Michael,

So I have 5 million rows and around 76 variables. Please find below the header and top 2 sample lines from the database. I checked that 1st and 2nd lines have 684 and 687 characters respectively and the header has 1557 characters. Hope this helps you help me. Thanks in advance!



//1st line

"1","66","66","66",101277,"RBI",2816221,"CORP","66-ANDER-KONTO","HILF",36,31000,"No Risk Assigned","ACTIVE","18-MAY-05",NA,"NONE","NO_PEP","RVV","N","02816221-937","EUR",3.63,"EUR","C","EUR",3.63,"SEC-DOM-EX","06-JUL-17","06-JUL-17","N","08-JUL-17","31000",NA,NA,NA,"AT","N",NA,NA,NA,"N","N","Gutschrift WP-KESt-Verrechnung BLZ 37449 VTNr.: 080005705 70961177/05.^07.2017 Depot 60.003.506 Basisgeschäft 75937477/05.07.2017","BDT","BDT",0,0,"BDT37449170705W1707057096117702",NA,NA,"SECURITIES","DOMESTIC","EXTERNAL",NA,NA,NA,294231040,"G-00",3348757869,"06-JUL-17",6,0,"Browsecat","Lakin and Sons",0,1,3525.3,0,0,0,0,1175.43333333333,1175.43333333333,-0.244258005859137,"VISA 13 digit"

//2nd line

"2","66","66","66",101278,"RBI",2816221,"CORP","66-ANDER-KONTO","HILF",36,31000,"No Risk Assigned","ACTIVE","18-MAY-05",NA,"NONE","NO_PEP","RVV","N","02816221-937","EUR",1186.64,"EUR","C","EUR",1186.64,"SEC-DOM-EX","06-JUL-17","06-JUL-17","N","08-JUL-17","31000",NA,NA,NA,"AT","N",NA,NA,NA,"N","N","Rücknahme Fonds 75937477-05.07.17 Depot       60.003.506 37449 AT0000A^00XN0        R-VIP 75 I (T) 8,000 Stk zu     148,330 EUR   05.07.17","BDT","BDT",0,0,"BDT37449170705W1707057593747700",NA,NA,"SECURITIES","DOMESTIC","EXTERNAL",NA,NA,NA,294231040,"G-00",3348757867,"06-JUL-17",6,0,"Skiba","Bosco-Cremin",0,1,3525.3,0,0,0,0,1175.43333333333,1175.43333333333,-0.244258005859137,"Voyager"


(Michael Hunger) #7

Thanks a lot for the details! How long did it run after the DISTINCT was removed?

(Bharat Ram Ammu) #8

Around 20 Minutes? Sorry I actually didn't use EXPLAIN , so dont have exact time