Complex SQL-like queries in Cypher

vnjogani · December 20, 2021, 11:03am

I am new to Cypher and trying to figure out how one would achieve some kinds of queries that are relatively straightforward in SQL. One that comes to mind is computing Jaccard Similarity. I know there is a special procedure that does that, but say I wanted to do that natively.

In SQL, I could just:

WITH 
Edges AS (
SELECT 
m1.movie as movie1, 
m2.movie as movie2
FROM movie_actors as m1 
LEFT JOIN movie_actors as m2 
ON m1.actor = m2.actor
),
Intersection AS (
SELECT movie1, movie2, COUNT(distinct actor) as num
FROM Edges
GROUP BY 1
),
Counts AS (
SELECT movie, COUNT(distinct actor) as num
FROM Edges
GROUP BY 1
)
SELECT movie1, movie2, (I.num) / (C1.num + C2.num - I.num) as similarity
FROM Intersection I 
JOIN Counts C1 ON I.movie1 = C1.movie 
JOIN Counts C2 ON I.movie2 = C2.movie

How would you do such chaining with the WITHs in Cypher? I would need to store and reuse intermediate results and I wasn't sure how to do that

martin3 · December 21, 2021, 10:15am

What about something like this:

MATCH (i)
WHERE i:Actor AND i:Director
WITH count(i) AS Intersection
MATCH (u)
WHERE u:Actor OR u:Director
WITH Intersection, count(u) AS Union
RETURN Intersection, Union, toFloat(Intersection) / Union AS `Jaccard index`

It doesn't look like very complex though.

Intermediate results are stored using the WITH-statement.

vnjogani · December 22, 2021, 7:41am

This is not quite what I had in mind -- For each pair of movies (m1, m2) I want to compute the Jaccard Similarity of their actors (sorry I should've clarified this earlier). For example, Matrix 1 and Matrix 2 might share 80% of the actors (intersection over union of the actors).

I know how I could compute the intersection:

MATCH (m2:Movie)<-[:ACTED_IN]-(i:Person)-[:ACTED_IN]->(m1:Movie)
RETURN m1, m2, count(distinct i) AS Intersection

And this seems to work more or less.

Doing a UNION of the actors between all pairs of movies is where I am having difficulties.

vnjogani · December 22, 2021, 7:46am

I tried something like this:

MATCH (m2:Movie)<-[:ACTED_IN]-(i:Person)-[:ACTED_IN]->(m1:Movie)
WHERE id(m1) = 9 AND id(m2) = 10
WITH m1, m2, count(distinct i) AS Intersection
MATCH (m3:Movie)<-[:ACTED_IN]-(i:Person)
WITH m3, COUNT(distinct i) as count_actors
WHERE m1 = m3 OR m2 = m3
RETURN m1, m2, Intersection / SUM(count_actors)

But I get this error:

In a WITH/RETURN with DISTINCT or an aggregation, it is not possible to access variables declared before the WITH/RETURN: m1 (line 6, column 7 (offset: 235))
"WHERE m1 = m3 OR m2 = m3"

Essentially in SQL, you can reference any of the preceding WITH clauses by their name, but it seems that in Cypher, you can only reference the one immediately before. Is that true? That would restrict the flexibility of the language quite a bit.

dana_canzano · December 22, 2021, 3:32pm

@vnjogani

your line 6 of

WITH m3, COUNT(distinct i) as count_actors

effectively indicates only carry forward variables m3 and count_actors.

But line 7 refers to variables m1 and m2. and line 8 refers to variable Intersection

Can you change line 6 to

with m1, m2, Intersection, m3, COUNT(disinct i) as count_actors

martin3 · December 22, 2021, 11:28pm

martin3:

MATCH (i)
WHERE i:Actor AND i:Director
WITH count(i) AS Intersection
MATCH (u)
WHERE u:Actor OR u:Director
WITH Intersection, count(u) AS Union
RETURN Intersection, Union, toFloat(Intersection) / Union AS `Jaccard index`

I did not have time to test the following pattern, could something like this work to calculate the union?

MATCH (m2:Movie)<-[:ACTED_IN]-(i:Person)-[:ACTED_IN]->(m1:Movie)
WITH m1, m2
MATCH (u:Person)-[:ACTED_IN]->(m:Movie)
WHERE ID(m) = ID(m1) OR ID(m) = ID(m2)

ameyasoft · December 24, 2021, 12:36am

Try this:

MATCH (m:Movie)<-[:ACTED_IN]-(i:Person))
where id(m1) = 9
with m1, collect(id(i)) as person1
MATCH (m2:Movie)<-[:ACTED_IN]-(j:Person)
where id(m2) = 10
with m1, person1, m2, collect(id(j)) as person2
with m1, m2, apoc.coll.intersection(person1, person2) as diff, 
apoc.coll.union(person1, person2) as u1
return m1.name as movie1, m2.name as movie2, (toFloat(size(diff))/ toFloat(size(u1))) as jaccard

vnjogani · December 24, 2021, 6:53am

This worked! Thank you :)

vnjogani · December 24, 2021, 6:55am

This also worked, and it makes sense -- albeit it's a bit counterintuitive coming from SQL. Readability-wise, the below solution looks a bit better to me but do you know which would be more efficient?

Topic		Replies	Views
Combine several queries Newbie Questions cypher , union	2	329	September 4, 2020
Basic intersection of sets using Cypher Cypher	1	344	June 10, 2020
Problems with complex hierarchic query in Cyper Cypher	10	1089	October 11, 2018
Similar Nodes Cypher	4	754	January 30, 2020
How to use Jaccard similarity algorithm in neo4j to find the similar nodes Procedures & APOC cypher	17	4342	January 17, 2019

August Summer Fun!

Complex SQL-like queries in Cypher

Related topics