cancel
Showing results for 
Search instead for 
Did you mean: 

Similarity Query using a string compare it to property on a node

peggyw
Node

I have a sequence string 'TTCTTGAAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCT'

I have nodes with the label Sequence and property seqFull which contains a large DNA String.

Want to return the nodes and the similarity score where the similarity score is greater the .75 (75%) where the input string finds a similar strings within a larger string on a Node in Neo4J

Not looking for exact match using the term CONTAINS but something like CONTAINS but not exact match but matches at 75% or greater

3 REPLIES 3

ameyasoft
Graph Maven
You can use apoc.text.jaroWinklerDistance to get the similarity and this gives a much better similarity. 
I am using this in a production database for different purpose. Need to use APOC library.

Here is an example with two sequence strings that I got from internet:

with "gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattg" as seq1 ,     
"gaaccgccaatagacaacatatgtaacatatttaggatatacctcgaaaataataaaccg" as seq2
return toInteger(apoc.text.jaroWinklerDistance(seq1, seq2) * 100) as similarity

Result:
similarity: 78

with "gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg" as seq1,
"gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg" as seq2

return toInteger(apoc.text.jaroWinklerDistance(seq1, seq2) * 100) as similarity

Result
similarity: 80

Thank you - sorry been a long time to respond. Got on a new project but this is exactly what I am looking for

ameyasoft
Graph Maven

Thanks for your appreciation. During my previous era I worked on biomembranes and surfactant-oil miscibility. By these studies, I developed lot of environment friendly solutions. THOSE WERE THE DAYS!! LIFE GOES ON..!

Now I am purely into Neo4j!
Let me know if you need any help and am very happy to help.

Thanks