I have a file with NODE IDs and a property called MACCS with 0 and 1. I want to calculate jaccard similarity . What is the efficient way to do it ? I have attached the file linke here . I want to load the file , query i am using is gph_conn is the connection. Any
gph_conn.query("""
// USING PERIODIC COMMIT 100
LOAD CSV WITH HEADERS FROM 'file:///D:/Github/dbtest.csv' AS row
UNWIND SPLIT(row.MACCS, ',') AS i
CREATE (m:Mol {DrugBank_ID: row.DrugBank_ID,
MACCS:toInteger(i)
}
)
""")
Then i want to call the gds.similarity.jaccard to perform similarity between one node to rest of the other nodes . Below doesn't work becasue of format of the
MATCH (n1:Mol {DrugBank_ID: 'DB00146'})
WITH n1, collect(n1:MACCS) AS fp1
MATCH (n2:Mol)
WITH n2, collect(n2:MACCS) as fp2
RETURN n1,n2,
gds.similarity.jaccard(toIntegerList(n1.ECFP4), toIntegerList(n2.ECFP4)) AS jaccard;
Above should retuirn similarity values. Is there is a way to calculate similarity faster with indexes ?I want to do 10 million rows .