cancel
Showing results for 
Search instead for 
Did you mean: 

Head's Up! Site migration is underway. Phase 2: migrate recent content

Calculate Jaccard similarity

abhik1368
Node

I have a file with NODE IDs and a property called MACCS with  0 and 1. I want to calculate jaccard similarity . What is the efficient way to do it ? I have attached the file linke here . I want to load the file , query i am using is gph_conn is the connection. Any 

gph_conn.query("""
// USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM 'file:///D:/Github/dbtest.csv' AS row
UNWIND SPLIT(row.MACCS, ',') AS i
CREATE (m:Mol {DrugBank_ID: row.DrugBank_ID,
MACCS:toInteger(i)
}
)
""")

 Then i want to call the gds.similarity.jaccard to perform similarity between one node to rest of the other nodes . Below doesn't work becasue of format of the 

MATCH (n1:Mol {DrugBank_ID: 'DB00146'})
WITH n1, collect(n1:MACCS) AS fp1
MATCH (n2:Mol)
WITH n2, collect(n2:MACCS) as fp2
RETURN n1,n2,
gds.similarity.jaccard(toIntegerList(n1.ECFP4), toIntegerList(n2.ECFP4)) AS jaccard;

 Above should retuirn similarity values. Is there is a way to calculate similarity faster with indexes ?I want to do 10 million rows .

2 REPLIES 2

abhik1368
Node
##The correct query is below
MATCH (n1:Mol {DrugBank_ID: 'DB00146'})
WITH n1, collect(n1:MACCS) AS fp1
MATCH (n2:Mol)
WITH n2, collect(n2:MACCS) as fp2
RETURN n1,n2,
gds.similarity.jaccard(toIntegerList(n1.MACCS), toIntegerList(n2.MACCS)) AS jaccard;

There doesn’t seem a need to collect fp1 and f p2, since they are not used and they should be empty