I have a sequence string 'TTCTTGAAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCT'
I have nodes with the label Sequence and property seqFull which contains a large DNA String.
Want to return the nodes and the similarity score where the similarity score is greater the .75 (75%) where the input string finds a similar strings within a larger string on a Node in Neo4J
Not looking for exact match using the term CONTAINS but something like CONTAINS but not exact match but matches at 75% or greater
You can use apoc.text.jaroWinklerDistance to get the similarity and this gives a much better similarity.
I am using this in a production database for different purpose. Need to use APOC library.
Here is an example with two sequence strings that I got from internet:
with "gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattg" as seq1 ,
"gaaccgccaatagacaacatatgtaacatatttaggatatacctcgaaaataataaaccg" as seq2
return toInteger(apoc.text.jaroWinklerDistance(seq1, seq2) * 100) as similarity
Result:
similarity: 78
with "gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg" as seq1,
"gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg" as seq2
return toInteger(apoc.text.jaroWinklerDistance(seq1, seq2) * 100) as similarity
Result
similarity: 80
Thanks for your appreciation. During my previous era I worked on biomembranes and surfactant-oil miscibility. By these studies, I developed lot of environment friendly solutions. THOSE WERE THE DAYS!! LIFE GOES ON..!
Now I am purely into Neo4j!
Let me know if you need any help and am very happy to help.