The Power of Subqueries in Neo4j 4.x

![](upload://jNBIp3yfU1C0p6pkzeNrZsC3Ocl.jpeg)Matryoshkas by Iza Gawrych on Unsplash

Nest your Cypher statements with subqueries!

Introduction

Neo4j 4.1 came out in June with the final piece required to make subqueries work well in Neo4j — So I wanted to use the opportunity for a deep dive into the topic.

For those of you who prefer to watch than read, I did a Twitch session on this topic last week. We have regular schedule of live streaming, do check out the calendar. Follow us there or YouTube to be notified when we go live!

I have to start with a disclaimer: In all the years writing SQL with a variety of databases, I have always disliked subqueries. I felt that they made queries harder to read, understand, and to reason about.

History

That’s why I always vetoed subqueries in Neo4j’s Cypher, and rather, pushed for a query pipeline that passes data from one query part to the next. Much like a Unix shell pipe that connects small tools, each one that does one job well with each other, like Lego blocks to form a more complex process.

In Cypher you achieve this with the WITH statement, allowing you to chain query parts together. WITH has the same semantics as the final RETURN, you can select columns, rename, aggregate, filter, paginate and sort your data. And all that WITHIN each query multiple times.

That is also the reason we never needed a HAVING keyword in Cypher, as it is just glorified post-filter after your result selection. You can have as many of those intermediate result-processing-passing steps in your Cypher query as you want.

Here is a quick example. Get me the cast of the movies with the highest ratings in our dataset.

// this is the first query part
MATCH (m:Movie)<-[r:RATED]-()
// this is the inline processing + passing of data
WITH m, avg(r.rating) AS rating
ORDER BY rating DESC LIMIT 5
// this is the next query part after
MATCH (m)<-[:ACTED_IN]-(a:Person)
RETURN m.title, collect(a.name) AS cast, rating

That’s also the approach you can use for handling many of the “traditional” subqueries. Built-in list processing, and an exists( (n)-[:REL]->(m) ) expression for patterns manages other use-cases.

Subqueries

However, subqueries support two use-cases that were not so easy to handle before.

One was a long-time requested feature called post-union-processing. A UNION (ALL) construct combines multiple independent queries into one result.

MATCH (:Person) RETURN 'person' AS type, count(*) AS count
UNION ALL
MATCH (:Movie) RETURN 'movie' AS type, count(*) AS count

But UNION itself is not a full query, i.e. you can’t do anything else except combine those queries, and have their (identically named) columns returned to the caller in any order. So in the past, you had to do any post-processing client-side. (or with apoc.cypher.run()).

Neo4j 4.0 introduced subqueries that addressed this issue. With the new CALL {} syntax, you can wrap a statement part as a subquery, and its result and cardinality will be available in the subsequent parts.

MATCH (q:Question)
CALL { RETURN 42 AS answer }
RETURN q.text AS question, answer

This calls the subquery which returns 42 as an answer, which is then available in the outer statement, for each question found.

To apply this to the UNION in question, you would wrap that statement as a subquery, and then can filter, sort, paginate (or do whatever) with the results.

//UNION wrapped in subquery
CALL {
MATCH (:Person) RETURN 'person' AS type, count(*) AS count
UNION ALL
MATCH (:Movie) RETURN 'movie' AS type, count(*) AS count
}
// post union processing
WITH * WHERE count > 100
RETURN type, count ORDER BY count DESC LIMIT 5

NOTE
Please note that the subquery functionality in Neo4j 4.0 was not able to access variables defined outside of their scope (non-correlated) . That has only changed in 4.1, see below.

My colleague Mark wrote a great developer guide on subqueries and Luanne Misquitta from our partner GraphAware dove into this in a full blog post on post-union-processing.

Existential Subqueries

Existential subqueries are a special construct, a partial subquery to test the existence of certain graph patterns. Before, you could express patterns with the existsfunction/predicate, which might get deprecated.

Now with existential subqueries the syntax is one of these three:

  • WHERE exists { (node)-[:REL]->…​(:Label) }, which is a shorthand for
  • WHERE exists { MATCH (node)-[:REL]->…​(:Label) }, which can be extended with WHERE
  • WHERE exists { MATCH (node)-[:REL]->…​(:Label) WHERE expression …​ }

It cannot contain RETURN, WITH, aggregations or other operations.

NOTE
Currently the existential subquery cannot be used as an expression, I think that’s a bug and will be fixed.

Navigating along sub-trees

When fetching data that contained many disparate sub-trees, the WITH and aggregation approach became a bit cumbersome, as your query was not just a pipe for processing, but a tree of nested sub-parts.

For example, if you want to get the co-actors of an actor in their shared movies, but also at the same time, the directors and their movies in a single query, you’re navigating two very distinct subtrees of your data. This was possible with multiple (OPTIONAL) MATCH expressions and incremental aggregation of map data structures, but not trivial.

(actor)-->(movie)<--(co-actors)
(actor)-->(movie)<--(director)-->(other-movies)
![](upload://oq2UrJiNLhcV4LRRXD0wZgOK6J1.png)Actress Helen Hunt with the two subtrees of co-actors and directors

In Neo4j 3.1 Andres Taylor, the father of Cypher, sneaked in two really cool features that were inspired by GraphQL and made this kind of querying so much easier.

First there are pattern comprehensions. Those are like list comprehensions, but allow an expression to use a graph pattern (with new identifiers) to be filtered, and then have an expression applied to each element [ (pattern…​) WHERE filter | expression].

These two return equivalent results:

MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
OPTIONAL MATCH (m)<-[:ACTED_IN]-(coActor:Person)
WHERE coActor.name contains 'T'
RETURN a.name, collect(coActor.name) AS coActors

and

MATCH (a:Person)-[:ACTED_IN]->(m:Movie)
RETURN a.name,
  [(m)<-[:ACTED_IN]-(coActor:Person)
WHERE coActor.name contains 'T' | coActor.name] AS coActors

Which is very cool as it doesn’t change the cardinality of your query like aMATCH would, counteracting with an aggregation like collect or count.

And then map projections. These can take a map-like element (map, node, relationship), and extract attributes in a concise syntax into a map again:
elem {.foo, .bar, .* , answer: 42}.

So you can sub-select the parts you are interested in, and then with the regular key:expression syntax you can start nesting the two together.

So our second “tree” can be expressed as:

[ (m)<-[:DIRECTED]-(d:Person)
| d { .name, movies:
[ (d)-[:DIRECTED]->(m2) | m2 {.title} ]
] AS directors

These features are especially useful if you want to regularly query nested structures with defined sub-tree selections, like in any object-graph-mapping tool, such as our GraphQL integration or Spring Data Neo4j.

So I wrote up this concept as a dedicated blog post a while ago:

Loading Graph Data for An Object Graph Mapper or GraphQL

The drawback is that those features are not optimized well by the query planner, and also don’t offer support for sorting and pagination. You can paginate by just applying a slice [0..10] after the fact, but that is only applied after the full comprehension has been computed. Also, sorting has to be simulated with a user defined function in apoc apoc.coll.sortMaps().

Full Subqueries

This is where full (or correlated) subqueries come in, which were added in Neo4j 4.1. These subqueries can now also access identifiers/variables from the outer scope, but those need to be declared explicitly at the beginning of the subquery (i.e. using WITH a,b,c).

I personally think that syntax is not optimal — it would have been nicer to align subquery and procedure call syntax, and treat them like parameters. Such a subquery can contain a full Cypher statement with all clauses and operations.

Here is our full example:

MATCH (a:Person)-[:ACTED_IN]->(m)
CALL {
WITH a,m
MATCH (m)<-[:ACTED_IN]-(co:Person)
WHERE a <> co AND co.name contains 'T'
WITH distinct co LIMIT 10
RETURN collect(co.name) AS coactors
}
CALL {
WITH m
MATCH (m)<-[:DIRECTED]-(d:Person)-[:DIRECTED]->(m2)
WITH d, collect(m2.title) AS movies
RETURN collect(d {.name, movies:movies}) AS directors
}
RETURN a.name, m.title, coactors, directors
![](upload://pRB8ioNpdPSTRbHBNor0fD6qdGc.png)Here is one row where both subqueries return a result, unfortunately a not great movie.

WARNING

The cardinality of a subquery can affect the outer query. If it doesn’t return any rows, the outer query will not return any rows. The opposite is also true, if your subquery returns multiple rows, then that will multiply the cardinality of your outer query. That’s why you should either use OPTIONAL MATCH in your subquery and/or a pure (single) aggregation with collect, so you always get one row from a subquery (except if you want more).

You cannot shadow existing identifiers. If you want to return something that was passed in, you have to rename it. Identifiers returned from a subquery are named as is. If you have an expression, the name of the identifier will be the same, e.g. n.value + 5, so make sure to always alias them properly, even if it’s not enforced (which is, in my humble opinion, an oversight, in WITH we enforced it back then).

Subqueries can be nested, and are planned better (like regular query parts) by the query analyzer. Within subqueries, you can now also use sorting and pagination, so our workarounds for GraphQL and Spring Data Neo4j will not be necessary any longer, starting with this version.

MATCH (m:Movie)
RETURN m { .title, actors: apoc.coll.sortMaps(
[(m)<-[:ACTED_IN]-(a:Person) | a {.name}], ["name"]
)[0..10]} AS movie

can become:

MATCH (m:Movie)
CALL {
WITH m
MATCH (m)<-[:ACTED_IN]-(a:Person)
WITH a ORDER BY a.name DESC LIMIT 10
RETURN collect(a {.name}) AS actors
}
RETURN m { .title, actors: actors }

Personally, I still like the pattern comprehension syntax much more. Perhaps at some point it can be extended with ORDER BY and LIMIT semantics as syntactic sugar for subqueries.

Updates

You can also do updates in a subquery. Remember that the subquery is executed each time for an outer row (with the same cardinality), so watch out for that (e.g. use MERGE instead of CREATE).

If you want to use subqueries for conditional updates, you should always return an aggregation, to make sure to not stop the outer query.

MATCH (m:Movie)
CALL {
WITH m
// conditional execution
WITH m WHERE exists(m.genres)
UNWIND m.genres AS genre
MERGE (g:Genre {name: genre})
MERGE (m)-[:IN_GENRE]->(g)
RETURN count(*) AS c
}
RETURN m, c

Subqueries in Neo4j Fabric

Finally, subqueries are also used in Neo4j’s sharding and federation approach — “Neo4j Fabric” — for delineating query parts that are meant to execute on different databases.

Here you can use subqueries together with the USE keyword to direct a query part to a certain database.

For example (from the Fabric documentation):

CALL {
USE movies.moviesUSA
MATCH (movie:Movie)
RETURN max(movie.released) AS usLatest
}
CALL {
USE movies.moviesEU
WITH usLatest
MATCH (movie:Movie)
WHERE movie.released = usLatest
RETURN movie
}
RETURN movie

This uses a fabric database named movie with two mounted shards/databases called moviesUSA and moviesEU.

References

Happy querying!

![|1x1](upload://6w7HOLoKuTDtEXRteNiYA53kW94.gif)

The Power of Subqueries in Neo4j 4.x was originally published in Neo4j Developer Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

2 Likes