How to speed up uploading data from csv in graph db

kalyan_b_aninda · August 19, 2019, 9:08am

Hello , this is my first topic in neo4j community and I am learning neo4j .I am recently trying to upload data into neo4j graphDB from csv files. I have a written a python script for that. Among my csv files, some csv file is large (3.2 GB or above) which contains roughly 50 million or above rows. I have done bulk import first and it worked well but I need to upload data into existing database so I used load csv for importing data into graphdb. since my data is very large , I have used apoc library(version 3.5.0.4) for using parallel features. my current cypher query is

CALL apoc.periodic.iterate('
                     load csv with headers from "file:///relcashoutTest.csv" AS row return row ','
                     MATCH (a:CUSTOMER {WALLETID: row.CUSTOMER})
                     MATCH (c:AGENT{WALLETID: row.AGENT})
                     MERGE(a)-[r:CASHOUT]->(c)
                     return count(*)
                     ',{batchSize:1000, iterateList:true, parallel:true})

this query for single cashout relationship. but I have others . In pyscript I am maintaining it dynamically.Happy thing is node creation works properly around 105 sec. I am facing problem to build relationships in nodes. My amazon instance have 32 CPU core with 240G RAM. I have observed that, firstly the parallelism works fine but after times it can't use all cores , in my case it is stuck between 2 -7 cores. I have printed some statistics , making 10 relations take 39 sec. yesterday I ran above relationship query for 8hours and I didn't get output. I am confused Constraint and indexing won't be helpful cause read and write trade off. Kindly help me out to solve this problem . my pyscript with this query works fine for small sized data. Thank you in advance. My neo4j version is 3.5.8

soham.dhodapkar · August 19, 2019, 10:13pm

Hi,
Good choice to go with apoc.
Can you try increasing the batchSize to maybe 10K?
Also, try using this statement before the CALL statement : USING PERIODIC COMMIT 10000
Here is the documentation of this clause: Planner hints and the USING keyword - Cypher Manual and LOAD CSV - Cypher Manual

andrew_bowman · August 19, 2019, 11:46pm

Maybe avoid parallel when merging relationships, that can be a recipe for lock contention, as relationship creation requires locks on the start and end nodes. If the same nodes appear multiple times in the CSV then there could be contention and deadlock issues between concurrently executing batches.

Also make sure you have indexes on :CUSTOMER(WALLETID) and :AGENT(WALLETID)

kalyan_b_aninda · August 21, 2019, 11:39am

I tried to increase batch size 10000 , but same as it is

michael.hunger · August 25, 2019, 6:29pm

Remove the parallel and increase the batch size to 50k or such.
You can also remove the RETURN count(*)

What is your heap/page-cache configuration for Neo4j?

Do you have the constraints?

Can you share:

EXPLAIN MATCH (a:CUSTOMER {WALLETID: row.CUSTOMER})
                 MATCH (c:AGENT{WALLETID: row.AGENT})
                 MERGE(a)-[r:CASHOUT]->(c)
                 return count(*)

kalyan_b_aninda · August 29, 2019, 4:43am

@Michael thanks for your reply. I have put constraint on id and the performance have increased significantly fast

Topic		Replies	Views
How to speed up uploading data from csv in graph db Neo4j Graph Platform apoc , bolt , import , migrated , cypher-tagged	1	333	November 16, 2022
Fastest way to load data in neo4j using python Cypher	5	9826	May 5, 2021
Load large CSV with LOAD CSV or python Neo4j Graph Platform migrated	2	1124	August 4, 2023
Help me merge 170M relationships with LOAD CSV Cypher load-csv	10	3639	October 23, 2019
How can I improve the performance of this query? Newbie Questions	5	1389	April 4, 2019

July Summer Fun!

How to speed up uploading data from csv in graph db

Related topics