Showing results for 
Search instead for 
Did you mean: 

Using distributed computing to write to Neo4j

Node Link

I have 3.2+ million nodes I want to write to neo4j and almost twice as many relationships. I am using Cypher sessions and batches to create these nodes and relationships but it is taking *a lot* of time. 

What I think would be a good solution is to distribute these nodes and relationships using Apache Spark (or GraphX) and write to neo4j from there. But is this something which can be done? Ideally I would like to use neo4j as my storage solution, but is it possible to make multiple connections to Neo4j at a given time to write into?



Hi @sanna_aizad 

The amount of nodes and relationships that you want to create is not a lot, and it should not take more than a couple of minutes to create them, depending on you configuration. If you can share more details about the memory_heap, pagecache size but also you model and how your query looks like, I can be able to help you more.

Things that can be done to improve writing to the db:

  1. Index the matching/merging property
  2. Use apoc (maybe apoc.periodic.iterate('query', 'query', {batchSize: 10000})) 

for nodes apoc can be done in parallel by adding parallel:true, but for the relationships you might probably end up in locks which would slow down the writing operations. Anyway if you want to use a more sophisticated strategy, you can do so but I do not see any benefit other than complicating your life.

Node Link

I meant to write 3 billion nodes. Sorry.

Hi @sanna_aizad,

You can try to create initially the nodes and use apoc (maybe apoc.periodic.iterate('query', 'query', {batchSize: 10000, parallel: true}))Later on, you can create the relationships but with parallel:false as relationships might be prone to locks and if done in parallel might slow down the process. Still this might need some time to process all the data. If you have indexes, use Create (instead of Merge), and have enough pagecache_size it should be working fine. Another strategy is to partition the graph load into disjunctive parts which means that parts that could be independent to be loaded together and then to create the final relationships in the end. Other than that, there is not much left from the db part. 

I hope that my answer gave you some hints on what you might do.