Upload large amounts of data on Neo4j Community Edition

sai.beathanabhotla · February 12, 2020, 1:34am

Hello

I am working with Neo4j Community Edition running on EC2 (r5.16xlarge instance type). I am trying to upload data from S3 buckets.

I have a number of CSV files (each with 1M records) and I am trying to upload data into Neo4j. I used LOAD CSV initially and now I am using apoc.load.csv after checking out a few topics on the community forum. Even this process is also taking lot of time to upload the data. My query looks something like below.

CALL apoc.periodic.iterate('
CALL apoc.load.csv({file_path}) yield map as row return row
','
MERGE ....
MERGE ....
MERGE ...
...
...
...
...
...
{batchSize:10000, parallel:true});

As seen above, I have a lot of MERGE operations in the query. Even to upload 10K records, it is taking more than a minute. I need to upload millions of records every minute. On the forum, someone suggested me to try neo4j-admin import but for my use case, I need to mutate the graph with new data every hour.

I tried to change the EC2 instance types by increasing the memory and CPU but no success. Please suggest me on how to go about this.

Thank you!

ganesanmithun323 · February 12, 2020, 6:05am

do you have index created on those properties you are trying to do the merge operation

sai.beathanabhotla · February 12, 2020, 6:27am

The MERGE operations are on nodes and links not on the properties of them. I’m using merge to create nodes and links. Is there any better way to do this?

The nodes and links shouldn’t be duplicated. This is why I’m using merge.

Thanks.

ganesanmithun323 · February 12, 2020, 6:32am

I understand that you do MERGE on nodes . But you should have a property which differentiates two nodes , right ? Ideally this would be the property based on which you don't want to create duplicate nodes and so you do MERGE operation rather than CREATE . I am saying that you need to have indices on these node properties to do the MERGE efficiently

sai.beathanabhotla · February 12, 2020, 3:58pm

The property of each node is the ID that differentiates one node from another. Can you please suggest me how to create indices on these while loading the data to make it faster? Thanks.

ganesanmithun323 · February 13, 2020, 6:26am

please refer to this link index creation using cypher

Topic		Replies	Views
How to speed up uploading data from csv in graph db Neo4j Graph Platform apoc , bolt , import , migrated , cypher-tagged	1	332	November 16, 2022
How to speed up uploading data from csv in graph db Cypher apoc , cypher , bolt , import	5	4476	August 29, 2019
Creating 50 million nodes in neo4j in fastest way Import / Export apoc , performance , import	4	59	April 9, 2025
Automate data upload process from AWS S3 to Neo4j CE running on EC2 Import / Export	10	3164	June 2, 2021
Using neo4j module and/or apoc to merge large number of nodes Import / Export	6	93	October 22, 2024

Get Certified in June!

Upload large amounts of data on Neo4j Community Edition

Related topics