How to handle large data insertion in Neo4j

psomesh94 · July 23, 2021, 2:58pm

Hello All,
We want to create more than 20 million nodes into neo4j. But while pushing the nodes data in Neo4j we have observed that the performance of the database was quite slow, so to tackle this we have tried following things:

We have used Py2neo client library and toolkit for working with Neo4j

We have used graph.create() api to create nodes without using any transaction control API’s
We have created a single query to create all nodes. We have used this single query in graph.Run() but we have observed that it is also taking a lot of time.
Then we used a transaction api like begin which returns a transaction object and then we are using graph.create() to insert the nodes in the database and then use commit API on the transaction object. We were able to commit when we were having less number of nodes in file but for large numbers of nodes we were not able to commit manually. To fix this we have used the auto-commit API’s provided by py2neo and closed the transaction manually at the end of code but still performance was not as good as expected.

What will be the best practice to insert millions of node at a time in Neo4j graph database ?
Is importing the node data using the CSV file affects the speed of the Neo4j graph database ?
For importing the CSV files in Neo4j we have to keep them in local directory of Neo4j, can we load the CSV files stored in our local drive ?

Thank you in advance.

david_allen · July 26, 2021, 2:22pm

It's hard to tell what's going on in your situation without knowing your data and code. I'm going to link this page that has a number of good principles about how to get the best possible performance out of bulk Neo4j inserts. This is about writing to Neo4j with spark, but the principles are the same whether you're using the spark connector, or whether you wrote your own program. It's about batching, parallelism, and what you're writing in the first place.

An anti-pattern I see a lot is when people have a big complicated cypher query that writes many different labels and rel types in one pass. This tends to create locks all over the graph and slows total insert performance. Another anti-pattern is unsorted input data or randomly sorted data that churns the neo4j page cache. Sometimes you can get improved performance by just sorting data before starting to write it to improve hit ratio, paticularly when your database is memory constrained. Another big problem people run into is just inappropriate memory tuning in the database.

good luck -- but if you continue to have issues after following this guidance, recommend you follow up with a code & data example, state what your insert rates are looking like, and where you're trying to get them to.

Topic		Replies	Views
Freezing(?) after 1.5 million insert with Java at 100% cpu Neo4j Graph Platform memory	5	1178	September 29, 2019
Improving data writing efficiency in python Cypher cypher	7	2098	April 12, 2020
What is the best Neo4j deployment to create Graph Database in Neo4j with huge volumes of data (~250GB)? Aura & Cloud	2	425	September 28, 2021
Tuning for larger-than-memory multiple-TB graph node insertion Operations performance	4	667	June 21, 2021
How to do large batch insert or upsert nodes and relationship neo4j using python driver Drivers & Stacks migrated	1	1259	November 4, 2022

How to handle large data insertion in Neo4j

Related topics