I am new to Cypher and trying to figure out how one would achieve some kinds of queries that are relatively straightforward in SQL. One that comes to mind is computing Jaccard Similarity. I know there is a special procedure that does that, but say I wanted to do that natively.
In SQL, I could just:
WITH Edges AS ( SELECT m1.movie as movie1, m2.movie as movie2 FROM movie_actors as m1 LEFT JOIN movie_actors as m2 ON m1.actor = m2.actor ), Intersection AS ( SELECT movie1, movie2, COUNT(distinct actor) as num FROM Edges GROUP BY 1 ), Counts AS ( SELECT movie, COUNT(distinct actor) as num FROM Edges GROUP BY 1 ) SELECT movie1, movie2, (I.num) / (C1.num + C2.num - I.num) as similarity FROM Intersection I JOIN Counts C1 ON I.movie1 = C1.movie JOIN Counts C2 ON I.movie2 = C2.movie
How would you do such chaining with the WITHs in Cypher? I would need to store and reuse intermediate results and I wasn't sure how to do that
Solved! Go to Solution.
What about something like this:
MATCH (i) WHERE i:Actor AND i:Director WITH count(i) AS Intersection MATCH (u) WHERE u:Actor OR u:Director WITH Intersection, count(u) AS Union RETURN Intersection, Union, toFloat(Intersection) / Union AS `Jaccard index`
It doesn't look like very complex though.
Intermediate results are stored using the WITH-statement.
This is not quite what I had in mind -- For each pair of movies (m1, m2) I want to compute the Jaccard Similarity of their actors (sorry I should've clarified this earlier). For example, Matrix 1 and Matrix 2 might share 80% of the actors (intersection over union of the actors).
I know how I could compute the intersection:
MATCH (m2:Movie)<-[:ACTED_IN]-(i:Person)-[:ACTED_IN]->(m1:Movie) RETURN m1, m2, count(distinct i) AS Intersection
And this seems to work more or less.
Doing a UNION of the actors between all pairs of movies is where I am having difficulties.
I tried something like this:
MATCH (m2:Movie)<-[:ACTED_IN]-(i:Person)-[:ACTED_IN]->(m1:Movie) WHERE id(m1) = 9 AND id(m2) = 10 WITH m1, m2, count(distinct i) AS Intersection MATCH (m3:Movie)<-[:ACTED_IN]-(i:Person) WITH m3, COUNT(distinct i) as count_actors WHERE m1 = m3 OR m2 = m3 RETURN m1, m2, Intersection / SUM(count_actors)
But I get this error:
In a WITH/RETURN with DISTINCT or an aggregation, it is not possible to access variables declared before the WITH/RETURN: m1 (line 6, column 7 (offset: 235)) "WHERE m1 = m3 OR m2 = m3"
Essentially in SQL, you can reference any of the preceding WITH clauses by their name, but it seems that in Cypher, you can only reference the one immediately before. Is that true? That would restrict the flexibility of the language quite a bit.
your line 6 of
WITH m3, COUNT(distinct i) as count_actors
effectively indicates only carry forward variables
But line 7 refers to variables
m2. and line 8 refers to variable
Can you change line 6 to
with m1, m2, Intersection, m3, COUNT(disinct i) as count_actors
MATCH (m:Movie)<-[:ACTED_IN]-(i:Person)) where id(m1) = 9 with m1, collect(id(i)) as person1 MATCH (m2:Movie)<-[:ACTED_IN]-(j:Person) where id(m2) = 10 with m1, person1, m2, collect(id(j)) as person2 with m1, m2, apoc.coll.intersection(person1, person2) as diff, apoc.coll.union(person1, person2) as u1 return m1.name as movie1, m2.name as movie2, (toFloat(size(diff))/ toFloat(size(u1))) as jaccard