cancel
Showing results for 
Search instead for 
Did you mean: 

Neo4j spark connector slow ingestion

zak1234
Node
  • Databricks notebook 500 gb ram mchine spark 3
  • using Neo4j Connector for Apache Spark 4.1.2
  • neo4j on 8 vcpus, 32 GiB memory vm
  • Data as deleta files parquets

i tried to ingest my edges and nodes from delta file to neo4j database using spark connector, but it gets slower and slower the first 4 mill edges took 1hr and it gets slower

i ingested 130 million nodes in 6 hrs, i see that other people ingest there billions on nodes and edges like in 1-2 hrs, what did i do wrong here 

5 REPLIES 5

I think you'll need a bit more memory on the neo4j machine.
Did you create the constraints so that the db can look up data efficiently during ingest for creating the connections?

What you mean by creating constraints, i used 

"schema.optimization.type": "NODE_CONSTRAINTS

 

santand84
Node

There are a ton of reasons that can contribute to slow down the process:

  • Neo4j hardware issues:
    • is the HD fast enough? 
    • is the RAM enough?
  • If you reuse the same Spark DataFrame over the time and you don't cache it this forces Spark to recompute it each time; so it seems that the ingestion is slow but this is because it recomputes the same data over and over again
  • The batch size is too slow or too big
  • The DataFrame partitioning is too low or too high
  • If you're using your own Cypher query to ingest the data, is it optimized?

The first thing to check is the query.log in order to understand which queries are slow

@santand84 called it out.  We have a graph that is ~32M nodes/1.7B edges that we load from Apache spark.  We've had to work our way through quite a number of performance issues on the loading side, mostly tuning the batch size and partitioning/executor count.

The bigger issue we run into with large loads where there is significant overlap in relationship/node coverage is locking issues on nodes from parallel/concurrent transactions.

santand84
Node

@brianmartin the best practice on batch importing the data with Spark is:

  • insert in parallel the nodes by partitioning the data via the node key column (otherwise this will lead to locking issues and you cannot leverage the parallelism); please consider that a high number of partitions can overwhelm the database no it's not about just giving enormous parallelism to the ingestion
  • insert all the relationships sequentially as there is no way to truly avoid deadlocks at this moment
  • as you said the batch size is also important, it and depends on the amount of ram that your Neo4j instance has