Data structure question (De-duplication, sort of...)


(Henry Macafee) #1

Greetings all!

I have been struggling with what might be considered deduplication, except it's a matter of deleting similar nodes, rather than precise duplicates. Let me try and explain:

I have (:Author) nodes, with [:WROTE] relationships to (:Book) nodes. Each (:Book) node has a unique ID property, as well as a varying number of relationships to (:Topic) nodes. However, I have duplicate nodes for some Books, so they share: 1. the Author node which :WROTE them, 2. The 'title' property amongst several nodes in many cases.

What I wish to do is to keep the single node, per book with a unique 'title' property, linked to the Artist who wrote it, based on the MAXIMUM number of relationships to (:Topic) nodes- essentially thinning the database by purging "duplicates" with fewer Topic links. Is this possible? Easy?

Thank you,
Henry


(Mike R Black) #2

To make sure I understand your data model correctly, this is your model:

(:Author)-[:WROTE]->(:Book)-[]->(:Topic)

Then you have some duplicates on the book nodes. There's a unique id for the books but the title would be the natural key and how you're determining if the book has a duplicate? You want to merge the duplicates, assuming the book with the most relationships is the one you want to keep?

Have you looked at the APOC merge procedures?

Do you really want to only keep the book node with the most relationships, or merge all the relationships onto a single node? I would think the latter because then you can combine all the work that was done to assign topics to books onto the single book node. If the former then write your query to collect the duplicates and unwind through the duplicates to delete.


(Henry Macafee) #3

You have the structure correct, yes. I think merging nodes would be a better solution, yes. Would that not create duplicate relationships? I'm not too familiar with the APOC merge procedures.


(Mike R Black) #4

Yes merging nodes will repoint the relationships from the node going away to the node that is staying. But once everything is consolidated you can then do merge relationship clean up. Here's an older stackoverflow post with some sample code.