Neo4j Cypher query to quickly find nodes with similar text property value

Let us assume that there are millions of nodes of a certain label, say Person. All the Person nodes have a property called fullName. I want to return top 5 matching nodes for each node by comparing each Person's fullName with the others. Example - Person A has fullName 'Michaels', person B 'Michael', person C 'Michel' and so on. Using an apoc text function, I can return top matching names based on its score. In this way, I want the top matching nodes for each person node (for million nodes.) I tried to frame a Cypher query but it's so time taking and would never give the results. It would be very helpful if this can be sorted out in an efficient and quick way. Thanks

You can index the fullName with a fulltext index and then use fuzzy matching on that.

somthing along the lines of

MATCH (p:Person) 
CALL db.index.fulltext.queryNodes("name-index",'"'+p.name+'"~1', {limit: 5}) yield node, score
RETURN count(*)

Otherwise you can store the phonetics version of the name and aggregate/search on that.
With the apoc text funcitons you get text similarities but basically a cross product.

2 Likes

Thanks for the response, Michael! However, for my use case, I wanna find similar matches even if the entire text is not matched. For example, let's have 4 nodes,
Person A has fullName - 'michael123'
Person B - 'michael678'
Person C - 'michel124'
Person D - 'shawn456'

In this case, if I query using Person A's fullName 'michael123', I won't be getting other nodes B and C which also have similar names. Ideally, in this case, I would want A to be matched with B as well as C with higher and lower scores respectively. I don't want to use apoc text similarity as it's time taking so it would be helpful if this can be sorted out in other ways.

Try this:

match (a:Person {fullName: "michael123"})
match (b:Person) where b.fullName <> a.fullName

with a, b, apoc.text.clean(a.fullName) as norm1, apoc.text.clean(b.fullName) as norm2
with toInteger(apoc.text.jaroWinklerDistance(norm1, norm2) * 100) as similarity, a, b
with a, b,similarity where similarity >= 80 
return a.fullName as aname, b.fullName as bname, similarity

Result:

With person fullName = "michael678" instead of "michael1678" the similarity drops down to 88.

Hi @ameyasoft
Thank you for your response. This approach works well for small amount of data.

But it takes so much time if it is done on a large number of nodes (millions of nodes)
That's why I have mentioned that I didn't want to use apoc text similarity functions.
It would have been cool if there was another way to quickly find top matching nodes for each from millions of nodes.

What about training a node embedding based on node fullname property, and then find topN nodes based on node similarities?

How will I train embeddings on nodes having properties holding string values? (As any GDS algorithm would work only with numbers)

Also, would node embeddings work for disconnected nodes? (as the emails are not connected.)

1 Like

What about using transformer to get a vector representation of your full name string, then use that as node property. Your person nodes are disconnected?