Large-scale edge deduplication

I have a graph with a few billion edges of a particular type. Let's say that they all follow the pattern of:
(a:Label1)-[r:REFERS_TO {"Reference_Type": "x"}]->(b:Label2)

The Reference_Type can take one of a few possible values and an edge may exist between a and b for each value, but only one edge should exist for a given value of that property.

Ingestion took a long time to run and afterwards I found that there are some duplicates due to unexpected issues in the data that were not properly accounted for. What is an efficient way to find an remove duplicate edges?

I am using Neo4j 5.23.0 and have APOC available.

I was thinking something along the lines of this, where I run for each possible value of the Reference_Type:

MATCH (a:Label1)-[r1:REFERS_TO {"Reference_Type": "x"}]->(b:Label2)
MATCH (a)-[r2:REFERS_TO {"Reference_Type": "x"}]->(b)
WHERE id(r1) < id(r2)
DELETE r2

Is there a better way to do this? I could insert a WITH r2 LIMIT 10000 to batch or call in transactions, but I'm open to any suggestions.

I think the following would be even worse as it has to consider even the unique edges and filter them later.

MATCH (a:Label1)-[r:REFERS_TO {"Reference_Type": "x"}]->(b:Label2)
WITH a, b, collect(r) as rels
WHERE size(rels) > 1
(and then delete all but one of the edges in rels)

Thanks!

NOTE - to avoid inserting duplicates, you probably could have used MERGE instead of CREATE when building the graph

// collect all the reference types into a list
MATCH ()-[r:REFERS_TO]->()
  WITH DISTINCT r.Reference_Type AS refType
  UNWIND refType AS currentRefType
// for each element of that list
  MATCH (a:Label1)-[rel:REFERS_TO]->(b:Label2)
    WHERE rel.Reference_Type = currentRefType
    WITH a, b, currentRefType, COLLECT(rel) AS rels
    WHERE size(rels) > 1
    FOREACH (r_to_delete IN tail(rels) | DELETE r_to_delete)

I'll leave you to do the batching :)

Thank you. We've resolved the ingest issue and although we were using MERGE, there were other unexpected data issues that led to the extra edges.

With the cypher you provided, there are so many edges that the batching is the important part. Doing the grouping produces so many results that I run out of available memory. Would wrapping this all in CALL { ... } IN TRANSACTIONS address that?

If you are running out of memory, i would think you probably need to batch the first half and then possibly run the second half inside a call ... but you're in a trial-and-error case now

See if this works. It deletes all but one relationship of a given type between two pairs of nodes.

MATCH (a:Label1)-[r:REFERS_TO]->(b:Label2)
WITH a, b, r.Reference_Type as refType, tail(collect(r)) as relsToDelete
WHERE size(relsToDelete) > 0
UNWIND relsToDelete as relationship
CALL (relationship) {
    delete relationship
} IN TRANSACTIONS OF 10000 ROWS
1 Like