I am a neo4j newbie and I am working on an entity resolution in my graph.
I have person nodes with a first name, last name and date of birth. I am looking to create an [:IS_SIMILAR] relationship between person nodes which are the same person entity.
I have 1M person nodes.
I am trying to use a cosine similarity function to resolve the entities. I have created the embeddings but my cosine similarity function is taking a very long time to run.
The code is provided below:
MATCH (p1:Person)
MATCH (p2:Person)
WHERE p1 <> p2
WITH p1 as person1, p1.embedding as p1Data, p2 as person2, p2.embedding as p2Data
WITH person1, person2, gds.similarity.cosine(
p1Data, p2Data
) AS cosineSimilarity
WHERE cosineSimilarity > 0.8
MERGE (person1) -[s:IS_SIMILAR]- (person2)
RETURN count(s)
I know this will create a cartesian product and try and evaluate the similarity of each pair of nodes. I would be very grateful if you could please let me know how I can optimise this as it is taking hours to run.
Here is a solution more optimized with APOC plugin:
CALL apoc.periodic.iterate("
MATCH (p1:Person)
MATCH (p2:Person)
WHERE id(p1) < id(p2)
WITH p1, p2, gds.similarity.cosine(p1.embedding, p2.embedding) AS cosineSimilarity
WHERE cosineSimilarity > 0.8
RETURN p1, p2, cosineSimilarity
", "
MERGE (p1)-[s:IS_SIMILAR]->(p2)
SET s.similarity = cosineSimilarity
",
{batchSize: 10000, parallel: true}
);
If you want to use KNN, you will have first to project your graph in-memory, then apply the algorithm on it. The solution with graph in-memory should be faster normally.
I was going to mention it, but @Cobra has it in his solution. Try using ‘id(p1)<id(p2)’ instead of ‘p1<>p2’, as the latter will result in each pair of person nodes getting evaluated twice. This is because in the Cartesian product, persons (a,b) is also represented by (b,a). As a result your query is evaluating twice as many nodes as necessary. I suspect your query resulted in two relations being created for each pair of nodes. Using the condition with node id inequalities eliminates one pair of the two pairs.