How to improve relationships insertion performance in a Neo4j database

Hi everyone,
I'm relatively new to Neo4j and currently facing a performance issue while working on a personal project. I'm using Neo4j to insert a large number of interconnected entities. Initially, I used the MERGE keyword to insert both nodes and relationships. This worked fine locally, but in production, where I'm dealing with over a million entities, the performance drops significantly—understandably so, as the wait times become exponential.

To improve performance, I switched to batching entities and relationships, and I also added uniqueness constraints on the "id" field of each entity type. This gave me excellent results locally. However, once again in production with a much larger dataset, inserting relationships still takes a considerable amount of time.

Is this expected behavior? And are there any recommended ways to further improve performance?
I've attached some Python code snippets related to the insertion logic for reference.

Thanks in advance for the help!

    def migrate_entities(self, entity_list=None):
        """
        Migrate entities from sqlite database to neo4j using database structure
        :param entity_list: list of entity concerned by migration
        :return:
        """
        relation_list = []
        if entity_list is None:
            table_list = self.mdb.get_all_table_name()
            entity_list = [e for e in table_list if "_and_" not in e]
            relation_list = [e for e in table_list if "_and_" in e]

        # Create uniqueness constraints for all entity types first on the field "id"
        for entity in entity_list:
            normalized_entity = entity.replace("-", "_")
            logger.info("Creating uniqueness constraint for %s", normalized_entity)
            self.neodb.create_unique_constraint(normalized_entity, "id")

        # Process entities
        for entity in entity_list:
            logger.info("Migrating entity %s", entity)
            self.migrate_table_sqlite_node4j(entity)
            self.entities.append(entity)

        # Process relationships
        for relation in relation_list:
            logger.info("Migrating relationship %s", relation)
            self.migrate_table_sqlite_relation_neo4j(relation)

        # Mark as completed
        if self.checkpoint_manager:
            self.checkpoint_manager.mark_completed()

and regarding relation insertion :

  def create_relationships_batch(self, relationships: List[Dict[str, Any]]) -> None:
      """
      Create multiple relationships in batches
      :param relationships:
          list of dictionaries containing from_id, to_id, rel_type and optional properties
      :return: None
      """
      if not relationships:
          return

      for i in range(0, len(relationships), self.batch_size):
          batch = relationships[i:i + self.batch_size]
          with self.driver.session() as session:
              query = """
                      UNWIND $batch AS rel
                      MATCH (a {id: rel.from_id})
                      MATCH (b {id: rel.to_id})
                      CALL apoc.merge.relationship(a, rel.type, {}, {}, b) YIELD rel AS createdRel
                      RETURN count(*)
                      """
              logger.info("Creating batch of %d relationships", len(batch))
              session.run(query, batch=batch)
  • You are going back-forth between your client and the server
  • You are calculating count(*) after every batch and sending it back to the client as well
  • If it is exponential in queries, it is also exponential in trips to the client (unnecessary?)

You should need the apoc procedure any longer, as a recent version of neo4j allows you to set node labels and relationship types with a variable.

I would also use call subquery in transactions to batch the operations.

Thank you for your reply,
I'm not really sure what you're telling me here. I've looked at the documentation and implemented this but I'm getting quite similar execution times. Do you have any other examples/advice?

    def create_relationships_batch(self, batch: List[Dict[str, Any]], batch_size: int = 500) -> None:
        """
        Create multiple relationships from a batch using native Cypher (no APOC),
        grouped by relationship type to allow dynamic typing without APOC.
        """
        from collections import defaultdict

        
        grouped = defaultdict(list)
        for item in batch:
            grouped[item["type"]].append(item)

        with self.driver.session() as session:
            for rel_type, rels in grouped.items():
                for i in range(0, len(rels), batch_size):
                    sub_batch = rels[i:i + batch_size]
                    query = f"""
                    UNWIND $batch AS rel
                    CALL {{
                        WITH rel
                        MATCH (a {{id: rel.from_id}})
                        MATCH (b {{id: rel.to_id}})
                        CREATE (a)-[r:{rel_type}]->(b)
                        SET r += rel.properties
                    }}
                    """
                    processed_batch = []
                    for rel in sub_batch:
                        processed_batch.append({
                            "from_id": rel["from_id"],
                            "to_id": rel["to_id"],
                            "properties": {k: v for k, v in rel.items() if k not in {"from_id", "to_id", "type"}}
                        })

                    session.run(query, batch=processed_batch)

for each relationship in the $batch array, this query finds the two nodes concerned (from_id and to_id), then creates a relationship of the indicated type (rel.type) between them, without duplicates, using APOC. It then returns the number of relationships processed. I recount the number to make sure that all X queries processed have been executed.

You're running a script "outside" your database, probably in a separate server in a networked environment.

Which means that your script goes back and forth between the client and the server, networks being 1000x slower than in-memory (e.g. DDR5 memory is 64Gbps vs typical premium network at best gives you 10Gbps if nothing else is happening).

I would suggest that you think of your solution a different way, for example:

You could create temporary nodes with lists, and then do a

MATCH -> UNWIND THE RESULT -> UNWIND THE PROPERTY -> CREATE RELATIONSHIP 

(and if you want update the temporary node with a label and the count ... so you can see 'progress'). That way everything happens within the server without going to a client.

Not really, I think I didn't provide enough context in my first post.
I take a dataset in stix 2.1 format which I clean up and format in order to add it to a neo4j database. The specifications for this standard describe ‘relationship’ objects that I use to define the relationships between entities. So there's no back and forth between a server and a client, but rather a single conversion.

You go back and forth on each of this, not on the query itself:

 for i in range(0, len(rels), batch_size):

Do you have an example code that's a bit more detailed?

I don't know what you will need on your nodes ... but anyway:

// create some sample data in triples ...
CREATE (t1:TEMP {
  ids1: [1, 2, 3, 4, 5],
  ids2: [10, 20, 30, 40, 50],
  ids3: [100, 200, 300, 400, 500]
}),
(t2:TEMP {
  ids1: [6, 7, 8],
  ids2: [60, 70, 80],
  ids3: [600, 700, 800]
})

MATCH (t:TEMP)
WITH t LIMIT 2  // <-- Adjust batch size as needed
UNWIND range(0, size(t.ids1)-1) AS index
WITH t, index, 
     t.ids1[index] AS id1, 
     t.ids2[index] AS id2, 
     t.ids3[index] AS id3
WITH t, collect({id1: id1, id2: id2, id3: id3}) AS mappedData
WITH mappedData
UNWIND mappedData AS entry
RETURN entry <-- here you replace this return with your CREATE

Note also that your MATCHes are extremely inefficient, which is killing your batching performance (except if your database is very very small). You can do an EXPLAIN of the query to look at the query plan, how the query is going to be executed.

MATCH (a {id: rel.from_id})
MATCH (b {id: rel.to_id})

There are no labels present in these MATCH patterns, so the planner is forced to do an AllNodesScan (evaluate against every single node in your database) to find each of the two nodes. And this all nodes lookup happens not just for each a and each b, but for every row that is processed. So if you are processing 100 rows, with a lookup of a and b per row, that is 200 separate AllNodesScans that are being performed, and this will become even more expensive as you add new nodes to your database.

At a minimum, for performance, you should know the label of the node you are looking up. That will at least improve from an AllNodesScan to a NodeByLabelScan, so with each lookup it will have to only look at every single node of the given label. That still isn't performant for batching.

Ideally, you not only want appropriate labels in your MATCH patterns, but also to create an index on the label in question and the id property that will be used by queries like these, that way with each lookup, it would perform a NodeIndexSeek (or MultiNodeIndexSeek) and find the exact nodes you are trying to find per row.

@andrew_bowman - if he created an index on id, would it make it slightly more efficient?

@andrew_bowman ... duh, ignore me ... i didn't read the last paragraph :smiley:

This isn’t what I was referring to. Look at this. It takes care of the batching.

Definitely need on index too.