Delete duplicate nodes if they have a relationship to the same node

Hi everyone!

I have a 4.4.6 Community DB with +87 M nodes (Labels: Date, Event, Source) and +92 M relationships (Types: DATE, SOURCE). I'm looking for a way to query nodes of the label 'Event' that

a) have the exact same property 'title' (I have a full text index) and
b) have a 'DATE' relationship to the same 'Date' node

to then delete all but one of the duplicate 'Event' nodes.

I found different approaches tackling the duplicate node problem, but none that takes relationships into account. Also during my failed attempts I sometimes ran into memory issues, but I don't know if hardware performance is actually an issue or if it was just bad Cypher. It's hosted on an aws t3.xlarge.

This is the farthest my attempts went (without the 'DATE' relationship part):

MATCH (a:Event)
CALL {
  WITH a
  MATCH (b:Event)
  WHERE a.title = b.title
  WITH collect(b) AS nodes
  CALL db.index.fulltext.queryNodes("event", a.title) YIELD node
  RETURN a.title, collect(node) + nodes AS nodes
}
RETURN DISTINCT title, nodes

But I did not get around the syntax errors. Any help is much appreciated!

Edit for clarification: This would be two duplicate 'Event' (green) nodes:

Besides the same date, do duplicate events also have to have the same source?

what ever solution is developed, I don't think is going to be very efficient with this many nodes/relationships. Are you doing this to remediate an existing defect? If not, I would also work on eliminating the source of duplicates before they are created (assuming that is possible in your environment).

I went ahead and assumed you wanted to delete events that had the same source, date, and title. The following query seems to work for this.

match (d:Date)<-[:DATE]-(e:Event)-[:SOURCE]->(s:Source)
with id(d) as d, id(s) as s, collect({title: e.title, event: e}) as events, collect(distinct e.title) as titles
unwind titles as title
with events, tail([i in events where i.title = title]) as events_to_delete
unwind events_to_delete as delete_event
detach delete delete_event.event

The following is the data before. Notice two sets of nodes have the same date, source, and title ('b' and 'c').

After:

I have no idea how this will perform in your environment. I believe the collect methods will have to be performed eagerly, so I would imagine all the data will have to be loaded into memory in order to group and collect the nodes.

Just a note, I suggest you test on a subset of data before you try your entire dataset. You can do so by adding constraints to the date or source nodes, so the query is restricted to just that small set.

I realized I could also group on the event title to greatly simplify the code. The following seems to work too:

match (d:Date)<-[:DATE]-(e:Event)-[:SOURCE]->(s:Source)
with id(d) as d, id(s) as s, e.title as title, tail(collect(e)) as events_to_delete
unwind events_to_delete as delete_event
detach delete delete_event
1 Like

Thanks a lot, @glilienfield!

First to answer your questions: Yes, the duplicate events would have the same source, so you assumed correctly! Also, the source of the duplicates was an error in the script populating the database from a specific data source. That has since been fixed, but the 'damage' was done. As for the efficiency of the query, im looking to deploy this once now while the script is stopped and maybe a second time, just to be sure, when the whole source has been transferred.

So im not too concerned with runtime for the query, but the memory (or lack there of) could become a problem.

I will do some testing as you suggested. Thanks again!

I can think of a couple options:

  1. Partition the query by source: are there a fixed number of known sources? We can alter the query for specific sources and then run them sequentially, so the memory requirement and execution times are less.
  2. Partition the query by date ranges. Same justification as in 1.

I just reread your post. You mentioned the source of the error was one data source. You can alter the query to be limited to that source or sources. Below is an example.

match (s:Source)
where s.id = 1000
match (d:Date)<-[:DATE]-(e:Event)-[:SOURCE]->(s)
with id(d) as d, id(s) as s, e.title as title, tail(collect(e)) as events_to_delete
unwind events_to_delete as delete_event
detach delete delete_event

Sorry, I think my usages of 'source' are unclear. So the label 'Source' I used for the nodes is the actual source i.e. article, text, etc. that the 'Event' is described in. When I talked about 'a specific data source', I meant a large file that is processed to populate the DB with, not a source in the sense of the label 'Source' .

I hope this clears it up and sorry for the confusion!

As for the query: I tested it on a +200K nodes local copy of the DB, where it ran in a few ms. So I went on to run it on the full DB (I replaced detach delete delete_event with return delete_event just to be save), where it now has been running for around 4 hours.

As for partitioning the query: Could the count of 'Source' nodes be a number to split by? I mean do the nodes actually have an order like 0–50.000 nodes and then next 50K – 100K etc.?

Each node does have a unique id, but it can be reused. We could use it here if no new Source nodes are being added during query execution. If that can't be controlled, do you have a self-assigned unique id you give to each Source node?

Assuming the use of a Source node's neo4j assigned id, the following query seems to work. It breaks the updates into batches set by the 'interval' value. The number of batches is set by the upper range of the 'range' method. Set it to a number that results in the max range times the interval being larger than your Source node's greatest id. You can determine that with the following query:

match(s:Source)
return max(id(s))

I wrapped the query in a 'call in transaction' so it commits the results in batches, thus not requiring as much memory.

:auto 
call {
    with 10000 as interval, range(0, 150) as indexes
    unwind indexes as index
    with index, interval*index as min, interval*(index+1) as max
    match (s:Source)
    where min <= id(s) < max
    match (d:Date)<-[:DATE]-(e:Event)-[:SOURCE]->(s)
    with id(d) as d, id(s) as s, e.title as title, tail(collect(e)) as events_to_delete
    unwind events_to_delete as delete_event
    detach delete delete_event
} in transactions of 10000 rows

I am assuming your are running this in Neo4j Browser, I don't believe 'call in transaction' works in cypher-shell, since cypher-shell is only explicit transactions and 'call in transaction' only works with implicit transactions. We had a thread a few weeks ago on this.

TEST THROUGHLY FIRST...I like your approach to replacing detach delete with a return. Note you can return id(delete_event) or some other property to reduce the output clutter.

The 'unbatched' query ran for around 17 hours, only to throw a memory error.

Currently there are no new sources being added, so your approach works!

I altered the query again to not delete but return the duplicates, but because of a syntax error I did it this way:

:auto 
call {
    with 10000 as interval, range(0, 125) as indexes
    unwind indexes as index
    with index, interval*index as min, interval*(index+1) as max
    match (s:Source)
    where min <= id(s) < max
    match (d:Date)<-[:DATE]-(e:Event)-[:SOURCE]->(s)
    with id(d) as d, id(s) as s, e.title as title, tail(collect(e)) as events_to_delete
    unwind events_to_delete as delete_event
    return delete_event
} in transactions of 10000 rows
return delete_event

I ran this exact query with 125 max range, because my thinking was: this will stop after 1.25M nodes and thus be completed earlier. Is this correct? It ran for approx. 1.5 hours and returned exactly 38k duplicates, which I think is an oddly round number at this scale.

The max id for Source is 87176930 so it would actually have to be 8800 max range to check all Source nodes, right?

Your implementation of 'return' is correct. You need to return from the call subquery, and then return it in the outer query.

A max of 8800 would work.

Keep in mind that the source node id's are not sequential, as the node id's are shared across all nodes.

Instead of returning 'delete_event', which is the entire node, maybe return only the id(delete_event). This will reduce the amount of data returned, which may make it run a little faster during your trials.

Have you looked at apoc.periodic.iterate? Using this function has radically decreased query run time for us, particularly when performing "global" queries that update the graph.