Hi everyone,
I'm relatively new to Neo4j and currently facing a performance issue while working on a personal project. I'm using Neo4j to insert a large number of interconnected entities. Initially, I used the MERGE
keyword to insert both nodes and relationships. This worked fine locally, but in production, where I'm dealing with over a million entities, the performance drops significantly—understandably so, as the wait times become exponential.
To improve performance, I switched to batching entities and relationships, and I also added uniqueness constraints on the "id" field of each entity type. This gave me excellent results locally. However, once again in production with a much larger dataset, inserting relationships still takes a considerable amount of time.
Is this expected behavior? And are there any recommended ways to further improve performance?
I've attached some Python code snippets related to the insertion logic for reference.
Thanks in advance for the help!
def migrate_entities(self, entity_list=None):
"""
Migrate entities from sqlite database to neo4j using database structure
:param entity_list: list of entity concerned by migration
:return:
"""
relation_list = []
if entity_list is None:
table_list = self.mdb.get_all_table_name()
entity_list = [e for e in table_list if "_and_" not in e]
relation_list = [e for e in table_list if "_and_" in e]
# Create uniqueness constraints for all entity types first on the field "id"
for entity in entity_list:
normalized_entity = entity.replace("-", "_")
logger.info("Creating uniqueness constraint for %s", normalized_entity)
self.neodb.create_unique_constraint(normalized_entity, "id")
# Process entities
for entity in entity_list:
logger.info("Migrating entity %s", entity)
self.migrate_table_sqlite_node4j(entity)
self.entities.append(entity)
# Process relationships
for relation in relation_list:
logger.info("Migrating relationship %s", relation)
self.migrate_table_sqlite_relation_neo4j(relation)
# Mark as completed
if self.checkpoint_manager:
self.checkpoint_manager.mark_completed()
and regarding relation insertion :
def create_relationships_batch(self, relationships: List[Dict[str, Any]]) -> None:
"""
Create multiple relationships in batches
:param relationships:
list of dictionaries containing from_id, to_id, rel_type and optional properties
:return: None
"""
if not relationships:
return
for i in range(0, len(relationships), self.batch_size):
batch = relationships[i:i + self.batch_size]
with self.driver.session() as session:
query = """
UNWIND $batch AS rel
MATCH (a {id: rel.from_id})
MATCH (b {id: rel.to_id})
CALL apoc.merge.relationship(a, rel.type, {}, {}, b) YIELD rel AS createdRel
RETURN count(*)
"""
logger.info("Creating batch of %d relationships", len(batch))
session.run(query, batch=batch)