Better to use COLLECT subqueries or put everything at the root?

Hi,

Personal background: I'm mostly a front-end React dev, but I started using Neo4j on a personal project. I have a basic understanding of SQL/Cypher syntax but largely ignorant about nuances of perf implications, especially with Cypher.

I have an app that uses movie data and I want to show the cast for a movie, as well as any other roles the actors from the current movie portrayed in other titles that the current user has watched. So in this example I've seen 4 movies, The Matrix, Memento, John Wick 1 & 2. I have two general approaches for the query:

First approach: "Root" query (without COLLECT subqueries)

MATCH (user:User)
MATCH (title {id:"603", type: "movie"})<-[:IN_TITLE]-(credit:Credit)<-[:PORTRAYED]-(person:Person)
MATCH (character:Character)-[:WAS_PORTRAYED]->(credit)
OPTIONAL MATCH (user)-[:WATCHED]->(relatedTitle)<-[:IN_TITLE]-(relatedCredit:Credit)<-[:PORTRAYED]-(person)
WHERE relatedTitle <> title
OPTIONAL MATCH (relatedCharacter:Character)-[:WAS_PORTRAYED]->(relatedCredit)
RETURN title, credit, person, character, relatedTitle, relatedCredit, relatedCharacter

With the "root" query, I get a record for each related role a person had, and each record includes the title (even though the title is the same on all records)

Response example:

image

Each row has "The Matrix" repeated, which is redundant. Also, since Keanu Reeves has more than one related role, "Keanu Reeves" is repeated once as well. However, writing queries this way seems a lot more "natural", i.e. it feels more like the examples in the docs and it feels more like how Cypher is intended to be used.

Second approach: "Nested" query (with COLLECT subqueries)

MATCH (user:User)
MATCH (title {id:"603", type: "movie"})
WITH *, COLLECT {
    MATCH (title)<-[:IN_TITLE]-(credit:Credit)<-[:PORTRAYED]-(person:Person)
    MATCH (character:Character)-[:WAS_PORTRAYED]->(credit)
    WITH *, COLLECT {
        MATCH (user)-[:WATCHED]->(relatedTitle)<-[:IN_TITLE]-(relatedCredit:Credit)<-[:PORTRAYED]-(person)
        WHERE relatedTitle <> title
        MATCH (relatedCharacter:Character)-[:WAS_PORTRAYED]->(relatedCredit)
        RETURN {title: relatedTitle, credit:  relatedCredit, character: relatedCharacter}
        ORDER BY relatedTitle.popularity DESC
    } as relatedRoles
    RETURN {credit: credit,character: character, person: person, relatedRoles: relatedRoles}
    ORDER BY credit.order ASC
} as cast
RETURN title, cast

The "nested" approach doesn't have this duplication though. It returns 1 record, which has a property for title and cast, which each have properties for related roles (or empty array if none are found). There is no redundant data, but the query looks weird(?) and I didn't see any examples like this in the docs.

Response example:

{
  "title": "The Matrix",
  "cast": [
    {
      "person": "Keanu Reeves",
      "character": "Thomas A. Anderson / Neo",
      "relatedRoles": [
        {
          "title": "John Wick",
          "character": "John Wick"
        },
        {
          "title": "John Wick: Chapter 2",
          "character": "John Wick"
        }
      ]
    },
    {
      "person": "Laurence Fishburne",
      "character": "Morpheus",
      "relatedRoles": [
        {
          "title": "John Wick: Chapter 2",
          "character": "Bowery King"
        }
      ]
    }
  ]
}

This is a lot easier to consume for my application as the shape is very similar to what I ultimately want to send to the UI, as opposed to having to iterate through all the records in the root approach and build the response object up that way. They both appear to be roughly the same performance. The subquery one seems more complicated to read, but easier to consume.

I know there might not be a "right way" but I'm curious if any of this is a code smell or if I should be thinking about these problems differently. Am I doing something weird?

I think which approach is appropriate depends on your needs. The tabular form may be better if you are importing the data into relational database or performing analytics or something similar. The nested structure would be better if you want you are representing a domain object that you are persisting into a NoSql or document database, or exchanging data with another system, etc.

If you don't mind, I have a couple comments on your queries. In both cases, you have started your query with a match on all User nodes. This is followed by two match statements that are not even correlated with the User results. This will create a Cartesian product between the rows of User nodes (1st match) and the rows from the subsequent two match. You can see this in the "explain" plan diagram for this query.

The user results from the first match are not even used until the optional match. This ordering and structure of your query results in a lot of rows extra rows being created until filters out. I believe you can drop the match (user:User) statement and get the same results if you add the "User" label to the 'user' node on line 4.

In general, to avoid Cartesian products, you don't want to just chain match statements together if they are not correlated. User collect subqueries is one technique to avoid this.

I suggest you do a little reading on how the results from one match affect the results of subsequent results, so you can learn how to avoid Cartesian products.

Here is the explain plan for your first query. You can see the Cartesian product being created in the plan.

Thank you very much for your comments! I did not know about the interaction between consecutive MATCH statements.

In the real queries I'm matching on a specific user ID but when I copy/paste these into the neo4j browser I was removing the ID because there's only one user in development (me) and I didn't know how to put params into the browser. (I have since figured that out)

Anyway, I made the changes you suggested and found a reduction of about 25x in db hits (54k -> 2k) during a PROFILE. Thank you!

1 Like