I have a sequence string 'TTCTTGAAGACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCT'
I have nodes with the label Sequence and property seqFull which contains a large DNA String.
Want to return the nodes and the similarity score where the similarity score is greater the .75 (75%) where the input string finds a similar strings within a larger string on a Node in Neo4J
Not looking for exact match using the term CONTAINS but something like CONTAINS but not exact match but matches at 75% or greater
You can use apoc.text.jaroWinklerDistance to get the similarity and this gives a much better similarity. I am using this in a production database for different purpose. Need to use APOC library. Here is an example with two sequence strings that I got from internet: with "gatcctccatatacaacggtatctccacctcaggtttagatctcaacaacggaaccattg" as seq1 , "gaaccgccaatagacaacatatgtaacatatttaggatatacctcgaaaataataaaccg" as seq2 return toInteger(apoc.text.jaroWinklerDistance(seq1, seq2) * 100) as similarity Result: similarity: 78 with "gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg" as seq1, "gaaccgccaa tagacaacat atgtaacata tttaggatat acctcgaaaa taataaaccg" as seq2 return toInteger(apoc.text.jaroWinklerDistance(seq1, seq2) * 100) as similarity Result similarity: 80
Thanks for your appreciation. During my previous era I worked on biomembranes and surfactant-oil miscibility. By these studies, I developed lot of environment friendly solutions. THOSE WERE THE DAYS!! LIFE GOES ON..!
Now I am purely into Neo4j!
Let me know if you need any help and am very happy to help.