Matching near-duplicates?

justin · April 25, 2021, 9:16am

Hi all. I'm doing some data cleaning and one issue I've run into is multiple nodes that are nearly identical. For example, there are instances of multiple different Person nodes with the same property values for name, employer, and title, but slightly different values for their LinkedIn address.

I would ideally like to write a query that returns sets of "duplicate" nodes based on shared property values (name, employer, and title), so I can create some evaluation rules and delete near-duplicates in subsequent queries. Any suggestions would be greatly appreciated for how I might go about matching and returning the near-duplicates. Thank you!

geethainfy · April 30, 2021, 6:53pm

MATCH (p:Person)
WITH p.name as name, collect(p) AS nodes
WHERE size(nodes) > 1
RETURN [ n in nodes | n.name] AS names, size(nodes)

with p.name can be extrapolated for employer and title

ameyasoft · May 3, 2021, 6:54am

Use APOC library to get similarity strength. I created a sample data and the results are here.

MERGE (a:Employee {name: "John", employer: "ABC", title: "Engineer", movie_genre: "Action"})

MERGE (a1:Employee {name: "John", employer: "ABC", title: "Engineer", movie_genre: "Thriller"}

MERGE (a2:Employee {name: "John", employer: "ABC", title: "Engineer", movie_genre: "Drama"}

MERGE (b:Employee {name: "John", employer: "ABC", title: "Manager", movie_genre: "Action"})

MERGE (b1:Employee {name: "John", employer: "ABC", title: "Neo4j Architect", movie_genre: "Thriller"})

MERGE (b2:Employee {name: "John", employer: "ABC", title: "Developer", movie_genre: "Drama"}

Similarity algorithms:

match (a:Employee)
match (b:Employee) where id(b) > id(a)

with (apoc.text.clean(a.name + a.employer + a.title))as norm1,  (apoc.text.clean(b.name + b.employer + b.title)) as norm2, a, b

with toInteger(apoc.text.jaroWinklerDistance(norm1, norm2) * 100) as similarity, a, b, norm1, norm2
with id(a) as ID1, id(b) as ID2, similarity, norm1, norm2
return ID1,norm1, ID2, norm2, similarity order by similarity desc

Result:

Similarity strength 100 means 100% match

Topic		Replies	Views
Merge all nodes with the same property name Cypher	14	13376	January 9, 2021
Exact match - Check for duplicate nodes / check for duplicate relationships Cypher apoc , performance , cypher	0	8267	December 6, 2018
Searching for Duplicates with CYPER match on properties Cypher	12	2080	August 17, 2020
Not detecting repeated nodes Neo4j Graph Platform migrated	7	171	January 20, 2023
Merge nodes within a larger graph on a given relation value Newbie Questions ruby , cypher	0	370	March 10, 2021

Matching near-duplicates?

Related topics