Large-scale edge deduplication

philip.graff · July 15, 2025, 5:41pm

I have a graph with a few billion edges of a particular type. Let's say that they all follow the pattern of:
(a:Label1)-[r:REFERS_TO {"Reference_Type": "x"}]->(b:Label2)

The Reference_Type can take one of a few possible values and an edge may exist between a and b for each value, but only one edge should exist for a given value of that property.

Ingestion took a long time to run and afterwards I found that there are some duplicates due to unexpected issues in the data that were not properly accounted for. What is an efficient way to find an remove duplicate edges?

I am using Neo4j 5.23.0 and have APOC available.

I was thinking something along the lines of this, where I run for each possible value of the Reference_Type:

MATCH (a:Label1)-[r1:REFERS_TO {"Reference_Type": "x"}]->(b:Label2)
MATCH (a)-[r2:REFERS_TO {"Reference_Type": "x"}]->(b)
WHERE id(r1) < id(r2)
DELETE r2

Is there a better way to do this? I could insert a WITH r2 LIMIT 10000 to batch or call in transactions, but I'm open to any suggestions.

I think the following would be even worse as it has to consider even the unique edges and filter them later.

MATCH (a:Label1)-[r:REFERS_TO {"Reference_Type": "x"}]->(b:Label2)
WITH a, b, collect(r) as rels
WHERE size(rels) > 1
(and then delete all but one of the edges in rels)

Thanks!

joshcornejo · July 16, 2025, 9:20am

NOTE - to avoid inserting duplicates, you probably could have used MERGE instead of CREATE when building the graph

// collect all the reference types into a list
MATCH ()-[r:REFERS_TO]->()
  WITH DISTINCT r.Reference_Type AS refType
  UNWIND refType AS currentRefType
// for each element of that list
  MATCH (a:Label1)-[rel:REFERS_TO]->(b:Label2)
    WHERE rel.Reference_Type = currentRefType
    WITH a, b, currentRefType, COLLECT(rel) AS rels
    WHERE size(rels) > 1
    FOREACH (r_to_delete IN tail(rels) | DELETE r_to_delete)

I'll leave you to do the batching :)

philip.graff · July 16, 2025, 7:06pm

Thank you. We've resolved the ingest issue and although we were using MERGE, there were other unexpected data issues that led to the extra edges.

With the cypher you provided, there are so many edges that the batching is the important part. Doing the grouping produces so many results that I run out of available memory. Would wrapping this all in CALL { ... } IN TRANSACTIONS address that?

joshcornejo · July 16, 2025, 7:09pm

If you are running out of memory, i would think you probably need to batch the first half and then possibly run the second half inside a call ... but you're in a trial-and-error case now

glilienfield · July 18, 2025, 12:19pm

See if this works. It deletes all but one relationship of a given type between two pairs of nodes.

MATCH (a:Label1)-[r:REFERS_TO]->(b:Label2)
WITH a, b, r.Reference_Type as refType, tail(collect(r)) as relsToDelete
WHERE size(relsToDelete) > 0
UNWIND relsToDelete as relationship
CALL (relationship) {
    delete relationship
} IN TRANSACTIONS OF 10000 ROWS

Topic		Replies	Views
Delete duplicate relations Cypher	9	5970	October 6, 2021
Optimization of Delete quey Cypher apoc , performance	1	192	November 6, 2023
Remove nodes duplicates and replace removed relationships with new one, with same properties values Newbie Questions	3	880	February 7, 2021
Delete duplicate nodes if they have a relationship to the same node Cypher	11	308	May 25, 2022
Finding duplicate relationships in giant graph, batch at a time Cypher	2	1531	March 25, 2021

July Summer Fun!

Large-scale edge deduplication

Related topics