What does WITH do? Graph Academy Question

I'm currently doing the course "Graph Data Modeling Fundamentals"(Graph Data Modeling Fundamentals | Free Neo4j Courses from GraphAcademy) and I'm very confused about what exactly happens to the data after a WITH is used. I've read the documentation from the Neo4j Cypher Manual, but it isn't really clicking for me. I've been working in Graph Academy with the "Movie" Database and this cypher code is given in the course to create language nodes from what was previously the .languages property.

MATCH (m:Movie)

UNWIND m.languages AS language

WITH language, collect(m) AS movies

MERGE (l:Language {name:language})

WITH l, movies

UNWIND movies AS m

WITH l,m

MERGE (m)-[:IN_LANGUAGE]->(l);

MATCH (m:Movie)

SET m.languages = null

As you can see, WITH is used multiple times in the above query, but what does it really do? What is the significance of the first variable (before the comma) and the second one (after the comma) and how do they become related after the clause is executed?

Assume the following data for the discussion. It consists of two movies nodes, each with a list of three languages.

create(n:Movie{languages:['english', 'french', 'arabic']})
create(n:Movie{languages:['english', 'german', 'spanish']})

  1. MATCH (m:Movie)

The match clause searches for all nodes with label "Movie". This will produce a result row for each Movie node. Sorry, I used 'n' in my queries instead of 'm' that is used in the question's query.

  1. UNWIND m.languages AS language

Property m.language is a list of languages. The unwind converts this list of languages into rows, one element of the language list per row. This is done for each Movie node in the result set. The other elements on each row are repeated for each language element. The result is a follows. Notice how the two result rows from above got converted into three rows each. The node 'n' corresponding to the language is repeated for each new row. I outputted the node's 'id', so you can see which rows are from the same node.

  1. WITH language, collect(m) AS movies

The 'WITH' clause is used to pass data between stages of a multi-stage query. In this case, we also have an aggregate function (collect) in the with clause, so the data is getting grouped by the properties outside the collect method's argument. In this case, the data is getting grouped by 'language', so the result will be a row for each different value of language and the collect will gather all the movie nodes that have that value of language. As you can see, the 'movies' column contains a list of movies. The list are the movie nodes that contained that language. The 'English' language has two movies nodes in its list, since each movie had 'English' as a language.

  1. MERGE (l:Language {name:language})

The 'language' and 'movies' values are now available to the 'match' statement since they were passed in the 'with' clause. This merge is creating a new 'Language' node, with the name property each to the value of language. In our example, this will create five language nodes, one for each row of the above result.

  1. WITH l, movies

The 'with' clause is passing the newly created Language node and the corresponding list of movies for this language. A 'with' clause is required between certain sets of clauses, such as a 'merge' followed by a 'match'. See how each row is one new 'language' node and has that language's corresponding list of movies that contain that language.

  1. UNWIND movies AS m

The unwind is taking the list of movie nodes on each row and unwinding them into rows. Each movie node from the list will also have its corresponding language node as part of the row. Only the English language node has multiple movies, so unwind its movies resulted in the first two rows. The other language nodes had only one element their movies lists, so only one row resulted from each language node.

  1. WITH l,m

The 'with' clause is passing the language node and movie node through to the next stage of the query. This will be the 'l' and 'm' columns from the above result set. This is now available for further processing.

  1. MERGE (m)-[:IN_LANGUAGE]->(l);

The 'merge' is creating a relationship between the 'movie' node and 'language' node for each row passed by the 'with'. Since the line is terminated with a ';', this cypher is complete.

Thank you for your awesome answer, I understand the function of the WITH clause now. I am now wondering though, in your last step,

  1. MERGE (m)-[:IN_LANGUAGE]->(l);

based on the graphic above it in step 6, it seems to me that 6 "m" nodes will be created and will point to whatever "l" nodes are appropriate even though we actually have only two movie nodes. Or the "m" nodes still have the unique ID's 15 and 16 from step 1 and therefore although there are 6 rows of 'm', there is really only two nodes and each node has 3 relationships?

Yes, these are still the same movie nodes matched earlier. They have been passed through the chain of 'with' clauses (either as a member of a list or individually following an unwind).

The output following the with l, m shows six rows, but with the same two movies.

The output following the MERGE (m)-[:IN_LANGUAGE]->(l)clauses is as follows: As you currently concluded, there are just two movies nodes, with three relationships each to language nodes. I included the id of the relationships so you can see there are six total relationships.

I don't know what happened. My original posting had numbered the steps, but now they are all labeled as '1'.

1 Like

So in the

WITH language, collect(m) AS movies

step, couldn't you ignore the collect function since the Merge clause that occurs afterwards already searches for duplicates nodes? So even though there would be two "english" under "language" it wouldn't matter once you use the MERGE to create the nodes?

You are correct again. The following query gave me the same results. This is because the 'unwind' keeps the association of the language's movie, so after the unwind you have rows that contain each language and its corresponding movie. The other key is the use of 'MERGE', as you already identified. Now, there may be a performance difference between the two queries if you have lots of movies and/or languages, because you are now relying on the merges to remove duplications.

unwind m.languages as language
MERGE (l:Language {name:language})
MERGE (m)-[r:IN_LANGUAGE]->(l)
return id(m), m, id(l), l, id(r) as relationship_id
order by id(m)

The ids are different, since I recreated the data: