How to improve relationships insertion performance in a Neo4j database

vinovrski.alexandre · April 22, 2025, 12:47pm

Hi everyone,
I'm relatively new to Neo4j and currently facing a performance issue while working on a personal project. I'm using Neo4j to insert a large number of interconnected entities. Initially, I used the MERGE keyword to insert both nodes and relationships. This worked fine locally, but in production, where I'm dealing with over a million entities, the performance drops significantly—understandably so, as the wait times become exponential.

To improve performance, I switched to batching entities and relationships, and I also added uniqueness constraints on the "id" field of each entity type. This gave me excellent results locally. However, once again in production with a much larger dataset, inserting relationships still takes a considerable amount of time.

Is this expected behavior? And are there any recommended ways to further improve performance?
I've attached some Python code snippets related to the insertion logic for reference.

Thanks in advance for the help!

    def migrate_entities(self, entity_list=None):
        """
        Migrate entities from sqlite database to neo4j using database structure
        :param entity_list: list of entity concerned by migration
        :return:
        """
        relation_list = []
        if entity_list is None:
            table_list = self.mdb.get_all_table_name()
            entity_list = [e for e in table_list if "_and_" not in e]
            relation_list = [e for e in table_list if "_and_" in e]

        # Create uniqueness constraints for all entity types first on the field "id"
        for entity in entity_list:
            normalized_entity = entity.replace("-", "_")
            logger.info("Creating uniqueness constraint for %s", normalized_entity)
            self.neodb.create_unique_constraint(normalized_entity, "id")

        # Process entities
        for entity in entity_list:
            logger.info("Migrating entity %s", entity)
            self.migrate_table_sqlite_node4j(entity)
            self.entities.append(entity)

        # Process relationships
        for relation in relation_list:
            logger.info("Migrating relationship %s", relation)
            self.migrate_table_sqlite_relation_neo4j(relation)

        # Mark as completed
        if self.checkpoint_manager:
            self.checkpoint_manager.mark_completed()

and regarding relation insertion :

  def create_relationships_batch(self, relationships: List[Dict[str, Any]]) -> None:
      """
      Create multiple relationships in batches
      :param relationships:
          list of dictionaries containing from_id, to_id, rel_type and optional properties
      :return: None
      """
      if not relationships:
          return

      for i in range(0, len(relationships), self.batch_size):
          batch = relationships[i:i + self.batch_size]
          with self.driver.session() as session:
              query = """
                      UNWIND $batch AS rel
                      MATCH (a {id: rel.from_id})
                      MATCH (b {id: rel.to_id})
                      CALL apoc.merge.relationship(a, rel.type, {}, {}, b) YIELD rel AS createdRel
                      RETURN count(*)
                      """
              logger.info("Creating batch of %d relationships", len(batch))
              session.run(query, batch=batch)

joshcornejo · April 23, 2025, 7:14am

You are going back-forth between your client and the server
You are calculating count(*) after every batch and sending it back to the client as well
If it is exponential in queries, it is also exponential in trips to the client (unnecessary?)

glilienfield · April 23, 2025, 11:08am

You should need the apoc procedure any longer, as a recent version of neo4j allows you to set node labels and relationship types with a variable.

I would also use call subquery in transactions to batch the operations.

vinovrski.alexandre · April 23, 2025, 11:47am

Thank you for your reply,
I'm not really sure what you're telling me here. I've looked at the documentation and implemented this but I'm getting quite similar execution times. Do you have any other examples/advice?

    def create_relationships_batch(self, batch: List[Dict[str, Any]], batch_size: int = 500) -> None:
        """
        Create multiple relationships from a batch using native Cypher (no APOC),
        grouped by relationship type to allow dynamic typing without APOC.
        """
        from collections import defaultdict

        
        grouped = defaultdict(list)
        for item in batch:
            grouped[item["type"]].append(item)

        with self.driver.session() as session:
            for rel_type, rels in grouped.items():
                for i in range(0, len(rels), batch_size):
                    sub_batch = rels[i:i + batch_size]
                    query = f"""
                    UNWIND $batch AS rel
                    CALL {{
                        WITH rel
                        MATCH (a {{id: rel.from_id}})
                        MATCH (b {{id: rel.to_id}})
                        CREATE (a)-[r:{rel_type}]->(b)
                        SET r += rel.properties
                    }}
                    """
                    processed_batch = []
                    for rel in sub_batch:
                        processed_batch.append({
                            "from_id": rel["from_id"],
                            "to_id": rel["to_id"],
                            "properties": {k: v for k, v in rel.items() if k not in {"from_id", "to_id", "type"}}
                        })

                    session.run(query, batch=processed_batch)

vinovrski.alexandre · April 23, 2025, 11:50am

for each relationship in the $batch array, this query finds the two nodes concerned (from_id and to_id), then creates a relationship of the indicated type (rel.type) between them, without duplicates, using APOC. It then returns the number of relationships processed. I recount the number to make sure that all X queries processed have been executed.

joshcornejo · April 23, 2025, 12:36pm

You're running a script "outside" your database, probably in a separate server in a networked environment.

Which means that your script goes back and forth between the client and the server, networks being 1000x slower than in-memory (e.g. DDR5 memory is 64Gbps vs typical premium network at best gives you 10Gbps if nothing else is happening).

I would suggest that you think of your solution a different way, for example:

You could create temporary nodes with lists, and then do a

MATCH -> UNWIND THE RESULT -> UNWIND THE PROPERTY -> CREATE RELATIONSHIP

(and if you want update the temporary node with a label and the count ... so you can see 'progress'). That way everything happens within the server without going to a client.

vinovrski.alexandre · April 23, 2025, 12:43pm

Not really, I think I didn't provide enough context in my first post.
I take a dataset in stix 2.1 format which I clean up and format in order to add it to a neo4j database. The specifications for this standard describe ‘relationship’ objects that I use to define the relationships between entities. So there's no back and forth between a server and a client, but rather a single conversion.

joshcornejo · April 23, 2025, 12:45pm

You go back and forth on each of this, not on the query itself:

 for i in range(0, len(rels), batch_size):

vinovrski.alexandre · April 23, 2025, 1:02pm

Do you have an example code that's a bit more detailed?

joshcornejo · April 23, 2025, 1:18pm

I don't know what you will need on your nodes ... but anyway:

// create some sample data in triples ...
CREATE (t1:TEMP {
  ids1: [1, 2, 3, 4, 5],
  ids2: [10, 20, 30, 40, 50],
  ids3: [100, 200, 300, 400, 500]
}),
(t2:TEMP {
  ids1: [6, 7, 8],
  ids2: [60, 70, 80],
  ids3: [600, 700, 800]
})

MATCH (t:TEMP)
WITH t LIMIT 2  // <-- Adjust batch size as needed
UNWIND range(0, size(t.ids1)-1) AS index
WITH t, index, 
     t.ids1[index] AS id1, 
     t.ids2[index] AS id2, 
     t.ids3[index] AS id3
WITH t, collect({id1: id1, id2: id2, id3: id3}) AS mappedData
WITH mappedData
UNWIND mappedData AS entry
RETURN entry <-- here you replace this return with your CREATE

andrew_bowman · April 23, 2025, 6:41pm

Note also that your MATCHes are extremely inefficient, which is killing your batching performance (except if your database is very very small). You can do an EXPLAIN of the query to look at the query plan, how the query is going to be executed.

MATCH (a {id: rel.from_id})
MATCH (b {id: rel.to_id})

There are no labels present in these MATCH patterns, so the planner is forced to do an AllNodesScan (evaluate against every single node in your database) to find each of the two nodes. And this all nodes lookup happens not just for each a and each b, but for every row that is processed. So if you are processing 100 rows, with a lookup of a and b per row, that is 200 separate AllNodesScans that are being performed, and this will become even more expensive as you add new nodes to your database.

At a minimum, for performance, you should know the label of the node you are looking up. That will at least improve from an AllNodesScan to a NodeByLabelScan, so with each lookup it will have to only look at every single node of the given label. That still isn't performant for batching.

Ideally, you not only want appropriate labels in your MATCH patterns, but also to create an index on the label in question and the id property that will be used by queries like these, that way with each lookup, it would perform a NodeIndexSeek (or MultiNodeIndexSeek) and find the exact nodes you are trying to find per row.

joshcornejo · April 24, 2025, 6:48am

@andrew_bowman - if he created an index on id, would it make it slightly more efficient?

joshcornejo · April 24, 2025, 6:50am

@andrew_bowman ... duh, ignore me ... i didn't read the last paragraph

glilienfield · April 24, 2025, 8:22pm

This isn’t what I was referring to. Look at this. It takes care of the batching.

Definitely need on index too.

Topic		Replies	Views
Need help optimizing my Queries to insert loads of Data Neo4j Graph Platform migrated	0	227	July 14, 2022
Importing Relationships / Nodes very slow Import / Export performance , cypher , import	3	1070	March 5, 2020
Very slow relationship creation (UNWIND) Cypher performance , cypher , relationship	8	2322	May 29, 2021
Neo4j relationship creation gets stuck during import Neo4j Graph Platform performance , cypher , import	10	2899	November 23, 2018
Speed up database creation? Neo4j Graph Platform	11	989	May 21, 2021

How to improve relationships insertion performance in a Neo4j database

Related topics