Neo4j Cypher query to quickly find nodes with similar text property value

awesomeanonymously88 · October 5, 2021, 7:41am

Let us assume that there are millions of nodes of a certain label, say Person. All the Person nodes have a property called fullName. I want to return top 5 matching nodes for each node by comparing each Person's fullName with the others. Example - Person A has fullName 'Michaels', person B 'Michael', person C 'Michel' and so on. Using an apoc text function, I can return top matching names based on its score. In this way, I want the top matching nodes for each person node (for million nodes.) I tried to frame a Cypher query but it's so time taking and would never give the results. It would be very helpful if this can be sorted out in an efficient and quick way. Thanks

michael.hunger · October 5, 2021, 11:14am

You can index the fullName with a fulltext index and then use fuzzy matching on that.

somthing along the lines of

MATCH (p:Person) 
CALL db.index.fulltext.queryNodes("name-index",'"'+p.name+'"~1', {limit: 5}) yield node, score
RETURN count(*)

Otherwise you can store the phonetics version of the name and aggregate/search on that.
With the apoc text funcitons you get text similarities but basically a cross product.

awesomeanonymously88 · October 6, 2021, 8:47pm

Thanks for the response, Michael! However, for my use case, I wanna find similar matches even if the entire text is not matched. For example, let's have 4 nodes,
Person A has fullName - 'michael123'
Person B - 'michael678'
Person C - 'michel124'
Person D - 'shawn456'

In this case, if I query using Person A's fullName 'michael123', I won't be getting other nodes B and C which also have similar names. Ideally, in this case, I would want A to be matched with B as well as C with higher and lower scores respectively. I don't want to use apoc text similarity as it's time taking so it would be helpful if this can be sorted out in other ways.

ameyasoft · October 6, 2021, 10:02pm

Try this:

match (a:Person {fullName: "michael123"})
match (b:Person) where b.fullName <> a.fullName

with a, b, apoc.text.clean(a.fullName) as norm1, apoc.text.clean(b.fullName) as norm2
with toInteger(apoc.text.jaroWinklerDistance(norm1, norm2) * 100) as similarity, a, b
with a, b,similarity where similarity >= 80 
return a.fullName as aname, b.fullName as bname, similarity

Result:

With person fullName = "michael678" instead of "michael1678" the similarity drops down to 88.

awesomeanonymously88 · October 9, 2021, 12:05am

Hi @ameyasoft
Thank you for your response. This approach works well for small amount of data.

But it takes so much time if it is done on a large number of nodes (millions of nodes)
That's why I have mentioned that I didn't want to use apoc text similarity functions.
It would have been cool if there was another way to quickly find top matching nodes for each from millions of nodes.

lingvisa · October 9, 2021, 4:51pm

What about training a node embedding based on node fullname property, and then find topN nodes based on node similarities?

awesomeanonymously88 · October 10, 2021, 10:55am

How will I train embeddings on nodes having properties holding string values? (As any GDS algorithm would work only with numbers)

Also, would node embeddings work for disconnected nodes? (as the emails are not connected.)

lingvisa · October 10, 2021, 3:20pm

What about using transformer to get a vector representation of your full name string, then use that as node property. Your person nodes are disconnected?

michael.h.schoenfiel · November 30, 2021, 3:42pm

I would like to know a solution to this as well. One approach I took (but it doesn't really solve this exactly) was to create a new FullName node (which could also be done as a set of Metaphone / Soundex nodes -- not sure if that works with numbers or not) and link my accounts to that node... then could maybe collapse a path between those that share a phonetic encoding as similar name?

Topic		Replies	Views
Full-text query only in neighbors of a specific node Cypher cypher	2	250	February 12, 2021
Similarity Query using a string compare it to property on a node Procedures & APOC	3	652	October 1, 2020
Text Similarity: Compare text property of one node to all other nodes and create relationship Cypher apoc , cypher , stored-procedures	2	1611	June 18, 2020
Full-text search on all nodes and all attributes Cypher	2	1616	September 24, 2019
Text similarity Cypher	2	347	September 12, 2021

Get Certified in June!

Neo4j Cypher query to quickly find nodes with similar text property value

Related topics