Data structure question (De-duplication, sort of...)

henry.macafee · September 24, 2018, 8:25pm

Greetings all!

I have been struggling with what might be considered deduplication, except it's a matter of deleting similar nodes, rather than precise duplicates. Let me try and explain:

I have (:Author) nodes, with [:WROTE] relationships to (:Book) nodes. Each (:Book) node has a unique ID property, as well as a varying number of relationships to (:Topic) nodes. However, I have duplicate nodes for some Books, so they share: 1. the Author node which :WROTE them, 2. The 'title' property amongst several nodes in many cases.

What I wish to do is to keep the single node, per book with a unique 'title' property, linked to the Artist who wrote it, based on the MAXIMUM number of relationships to (:Topic) nodes- essentially thinning the database by purging "duplicates" with fewer Topic links. Is this possible? Easy?

Thank you,
Henry

mike_r_black · September 24, 2018, 8:47pm

To make sure I understand your data model correctly, this is your model:

(:Author)-[:WROTE]->(:Book)-[]->(:Topic)

Then you have some duplicates on the book nodes. There's a unique id for the books but the title would be the natural key and how you're determining if the book has a duplicate? You want to merge the duplicates, assuming the book with the most relationships is the one you want to keep?

Have you looked at the APOC merge procedures?

Do you really want to only keep the book node with the most relationships, or merge all the relationships onto a single node? I would think the latter because then you can combine all the work that was done to assign topics to books onto the single book node. If the former then write your query to collect the duplicates and unwind through the duplicates to delete.

henry.macafee · September 24, 2018, 8:51pm

You have the structure correct, yes. I think merging nodes would be a better solution, yes. Would that not create duplicate relationships? I'm not too familiar with the APOC merge procedures.

mike_r_black · September 24, 2018, 9:18pm

Yes merging nodes will repoint the relationships from the node going away to the node that is staying. But once everything is consolidated you can then do merge relationship clean up. Here's an older stackoverflow post with some sample code.

Topic		Replies	Views
Merge duplicate nodes into one with relationship Cypher merge	4	4130	December 6, 2018
Delete duplicate data and restore relationship Cypher cypher	2	1780	March 17, 2020
Cannot delete node<id>, because it still has relationships. To delete this node, you must first delete its relationships Cypher apoc	5	2417	August 30, 2021
Remove property duplicates Browser	2	335	March 25, 2020
Apoc.merge.relationship() creates duplicates Cypher apoc , cypher	1	289	November 2, 2021

July Summer Fun!

Data structure question (De-duplication, sort of...)

Related topics