K nearest neighbor with cypher and euclidian distance

Hello everyone!

I am currently trying to make recommendations based on collaborative Filtering. I have books users and authors in my database. The user are connected to the books with RATED and ADDED TO LIST. The books and authors with WRITTEN_BY.

// k-nearest neighbor

MATCH (u1:User {user_id: 11927})-[r:RATED]->(b:Book)

WITH u1,r.rating AS rating_u1

MATCH (u1)-[r1:RATED]->(b:Book)<-[r2:RATED]-(u2)

WITH u1, u2, COLLECT({r1: r1, r2: r2}) AS ratings WHERE size(ratings) > 5

MATCH (u2)-[r:RATED]->(b:Book)

WITH u1, u2, r.rating AS rating_u2, ratings

UNWIND ratings AS r

WITH sqrt(sum((r.r1.rating - r.r2.rating)^2)) AS euclidean_distance,

u1, u2 WHERE euclidean_distance <> 0

WITH u1, u2, 1 / (1 + euclidean_distance) AS similarity

ORDER BY similarity DESC LIMIT 10

MATCH (u2)-[r:RATED]->(b:Book)

WHERE NOT EXISTS((u1)-[:RATED]->(b))

RETURN b.title, SUM(similarity * r.rating) AS score

ORDER BY score DESC LIMIT 25

With this cypher queries I want to compare each pair of users who have rated at least 5 common books. Then, I want to compare the 10 most similar users to the user with ID 11927. But there must be something wrong with my code, it is not doing what it should. Have you an idea how I could solve this? (without graph data science playground or AuraDS. Thank you

You some unnecessary querying going on in your query. For instance, after your first match, you can start the next match from b. The size of the collection should be the same as the number of u2 nodes.

I think the KNN metric would be practical for a given node, but not if you want to find the similarity between each pair of nodes, as I think that would be inefficient in cypher.

As an alternative, you could measure the similarity between pairs of nodes using the Jaccard measure. This could be applied in your problem by counting the number of books reviewed in common between to reviewers, compared to all the books the two reviewers reviewed.

I think the following query derives this metric (I did it kinda quickly).

match (r1:Reviewer)-[:REVIEWED]->(b:Book)<-[:REVIEWED]-(r2:Reviewer)
where id(r1) < id(r2)
with r1, r2, count(b) as intersection_count, count{(r1)-[:REVIEWED]->(:Book)} as r1_count, count{(r2)-[:REVIEWED]->(:Book)} as r2_count
return *, toFloat(intersection_count) /  toFloat(r1_count + r2_count - intersection_count) as Jaccard_similarity

Test Data:

create(a0:Reviewer{id:0})
create(a1:Reviewer{id:1})
create(a2:Reviewer{id:2})
create(a3:Reviewer{id:3})
create(b0:Book{id:0})
create(b1:Book{id:1})
create(b2:Book{id:2})
create(b3:Book{id:3})
create(a0)-[:REVIEWED]->(b0)
create(a0)-[:REVIEWED]->(b1)
create(a1)-[:REVIEWED]->(b0)
create(a1)-[:REVIEWED]->(b1)
create(a1)-[:REVIEWED]->(b2)
create(a2)-[:REVIEWED]->(b1)
create(a2)-[:REVIEWED]->(b2)

Result:

You can add your constraint to only compare reviews that have at least 5 books reviewed in common by adding a 'where' condition.

match (r1:Reviewer)-[:REVIEWED]->(b:Book)<-[:REVIEWED]-(r2:Reviewer)
where id(r1) < id(r2)
with r1, r2, count(b) as intersection_count, count{(r1)-[:REVIEWED]->(:Book)} as r1_count, count{(r2)-[:REVIEWED]->(:Book)} as r2_count
where intersection_count > 4
return *, toFloat(intersection_count) /  toFloat(r1_count + r2_count - intersection_count) as Jaccard_similarity

Hope this helps.

1 Like

Thank you very much!!!

1 Like