Using distributed computing to write to Neo4j

sanna_aizad · June 9, 2022, 10:56am

I have 3.2+ million nodes I want to write to neo4j and almost twice as many relationships. I am using Cypher sessions and batches to create these nodes and relationships but it is taking *a lot* of time.

What I think would be a good solution is to distribute these nodes and relationships using Apache Spark (or GraphX) and write to neo4j from there. But is this something which can be done? Ideally I would like to use neo4j as my storage solution, but is it possible to make multiple connections to Neo4j at a given time to write into?

busymo16 · June 17, 2022, 6:14am

Hi @sanna_aizad

The amount of nodes and relationships that you want to create is not a lot, and it should not take more than a couple of minutes to create them, depending on you configuration. If you can share more details about the memory_heap, pagecache size but also you model and how your query looks like, I can be able to help you more.

Things that can be done to improve writing to the db:

Index the matching/merging property
Use apoc (maybe apoc.periodic.iterate('query', 'query', {batchSize: 10000}))

for nodes apoc can be done in parallel by adding parallel:true, but for the relationships you might probably end up in locks which would slow down the writing operations. Anyway if you want to use a more sophisticated strategy, you can do so but I do not see any benefit other than complicating your life.

sanna_aizad · June 20, 2022, 12:52pm

I meant to write 3 billion nodes. Sorry.

busymo16 · June 22, 2022, 2:13pm

Hi @sanna_aizad,

You can try to create initially the nodes and use apoc (maybe apoc.periodic.iterate('query', 'query', {batchSize: 10000, parallel: true})). Later on, you can create the relationships but with parallel:false as relationships might be prone to locks and if done in parallel might slow down the process. Still this might need some time to process all the data. If you have indexes, use Create (instead of Merge), and have enough pagecache_size it should be working fine. Another strategy is to partition the graph load into disjunctive parts which means that parts that could be independent to be loaded together and then to create the final relationships in the end. Other than that, there is not much left from the db part.

I hope that my answer gave you some hints on what you might do.

Topic		Replies	Views
Can we find any benchmarking figures for neo4j spark connector (DataFrame to DB) Neo4j Graph Platform	1	469	November 12, 2020
Creating 50 million nodes in neo4j in fastest way Import / Export apoc , performance , import	4	75	April 9, 2025
Relationship creation taking a lot of time using spark connector Cypher apoc , cypher	2	153	May 30, 2024
Improving data writing efficiency in python Cypher cypher	7	2120	April 12, 2020
Improving performance/time of writing large batches of nodes and relationships Cypher performance , cypher , operations	1	1701	December 19, 2019

Demystifying Neo4j UX Research

Using distributed computing to write to Neo4j

Related topics