Negate Variable Length Path Query


(Rcfro2) #1

dataset *very similar to Movie dataset provided by Neo4j:

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person),
(x) - [:ACTED_IN] -> (m)
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x) and p<>x
return distinct x.name
match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person),
(x) - [:ACTED_IN] -> (m)
where not (p) - [:ACTED_IN] -> (m) and p<>x
return distinct x.name

These two, given the path, I would expect to be the same but the first query returns a smaller set of results. Why ?

Specifically saying WHERE NOT (p) - [:ACTED_IN] -> (m) should be equivalent to saying WHERE NOT (p)-[:ACTED_IN]->()<-[:ACTED_IN]-(x)


(Michael Hunger) #2

Not it isn't because the predicate is executed per row and not for all possible movies.

So in each row you have a concrete set of: p,x,m which the predicate is executed against.
While in the first case you have a general expression if they haven't ever jointly acted in any movie, not just m


(Rcfro2) #3

But what's the difference there? The ALL possible movies are only from a set of movies in which they acted together. First one is where NOT any movie p and x acted in which could be some movie, m, right?

In the second case, why isn't that held true for movies m ? It again refers to movies that X acted in or, in other words, movies that X acted in that P acted in as well. Can you please clarify this? this seems oddly confusing

In short can you please provide a more clear explanation? Maybe a tangible example of where X acted in Movies M and then where not P acted in those movies VS where not P and X acted in movies together? those statements sound equivalent.

What's the question we would ask that would make us arrive at one implementation vs another?

Also, the two queries below are replicas of that use case and they give the same exact result:

match (p:Person{name:'Keanu Reeves'}) - [:ACTED_IN] -> (x)
where not (:Person{name:"Hugo Weaving"})-[:ACTED_IN]->(x)<-[:ACTED_IN]-(p)
return collect(distinct x.title)

vs

match (p:Person{name:'Keanu Reeves'}) - [:ACTED_IN] -> (x)
where not (:Person{name:"Hugo Weaving"})-[:ACTED_IN]->(x)
return collect(distinct x.title)

why is that ?


(Michael Hunger) #4

Because it's not a set if m is bound to a single movie. Then only that bound value will be checked against, and the predicate will be executed per line, not globally.

If you want to check against the whole movie set of a contact you have to use a list quantor like none, all, any
with a inline pattern (pattern comprehension) or size with a pattern comprehension.

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person)
where p<>p AND 
  none(m IN [(x) - [:ACTED_IN] ->(m) | m] where (p) - [:ACTED_IN] -> (m)] 
return distinct x.name

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person)
where p<>p AND 
  size( [(x) - [:ACTED_IN] ->(m) where (p) - [:ACTED_IN] -> (m) | true]  )  =  0
return distinct x.name

(Rcfro2) #5

Why is m bound to a single movie, when the following is a list of all movies ? **not list, i mean rows. sorry.

match (p:Person{name:'Tom Hanks'})-[:ACTED_IN]->(m)
return m.title

what's the reasoning behind this? and to what extent and in what way is the bounding to some arbitrary movie determined?

And is it to say the general form of
(p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
implies that it negates all cases of when X and P act in a movie together similar to the check against the whole set query you provided?


(Andrew Bowman) #6

The query you provided does not return a list of movies. It returns a stream of records (or rows) of titles of movies that matched the pattern.

Unless you explicitly collect() something, or you use pattern comprehensions, or you otherwise invoke a function or procedure that results in a list, then you're not working with lists.

Let's go back to the queries in your original question.

First, a reference query whose results we may refer to later: Let's get all coactors of Tom Hanks:

match (p:Person{name:'Tom Hanks'})-[:ACTED_IN*2]-(coactor)
return distinct coactor.name

Again, this isn't a list...it's a stream of rows. Each row is an individual result, the coactor that was matched to in the pattern.

If you look at the results, you'll find that Charlize Theron is a coactor of Tom Hanks (if you perform some additional queries you would see that they co-acted in "That Thing You Do").

Now let's only take the first part of those queries, the MATCH but not the WHERE, and return the contact and the movie:

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person), 
(x) - [:ACTED_IN] -> (m)
return x.name, m.title

Again, there are no lists here. The return is a stream of rows, where each row refers to a separate person and movie node that matched the pattern.

Note some of the matches that may be near the top:

╒════════════════════════╤═════════════════════════════════╕
│"x.name"                │"m.title"                        │
╞════════════════════════╪═════════════════════════════════╡
│"Charlize Theron"       │"That Thing You Do"              │
├────────────────────────┼─────────────────────────────────┤
│"Charlize Theron"       │"The Devil's Advocate"           │
├────────────────────────┼─────────────────────────────────┤

Each line corresponds with a path found in the graph that fits the pattern, where each node variable refers to an individual node, not a collection.

Looking at JUST these two rows, we can see that indeed Charlize Theron is someone Tom Hanks has had contact with (2 :HAS_CONTACT hops). Charlize has acted in movies. Since this graph only has two movies she has acted in, both were matched.

So in the first row, x matched to Charlize Theron, and m, a movie she acted in, was matched to That Thing You Do. In the next row, we still have Charlize Theron for x, and m in this case is the other movie she acted in, The Devil's Advocate.

If we applied the relevant part of the WHERE clause in the first query here:
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
this would eliminate Charlize Theron, because this pattern exists in the graph, Tom Hanks has acted in something that Charlize Theron has acted in (as mentioned above...they are coactors through That Thing You Do). Both rows with Charlize Theron would be filtered out. The key thing to see is that m has nothing to do with that predicate (excluding anyone that has acted in something Tom Hanks has acted in).

However, if we only used the WHERE clause of the second query:
where not (p) - [:ACTED_IN] -> (m)
then only one row would be filtered out: the row with Charlize Theron and That Thing You Do, since Tom Hanks acted in That Thing You Do (but he didn't act in The Devil's Advocate). This predicate deals with m specifically. For the movie node matched by m, we're excluding the row if Tom Hanks acted in that movie.

Charlize Theron would remain, because Charlize Theron acted in a movie that Tom Hanks did not act in.


(Rcfro2) #7

Right - I didn't mean lists in the proper sense i.e. sense it's used to mean an array or set of items, just that it returns some number of items.

Back to the original question -
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

filters out The Devil's Advocate. Why? The pattern of P (Tom Hanks) acting in something that X (Charlize Theron) has acted in doesnt apply to The Devil's Advocate specifically. It is arbitrairly true that Tom Hanks and Charlize Theron have acted in a movie together, but not specifically The Devil's Advocate...so why is it removed?


(Andrew Bowman) #8

Remember, you're working with streams of rows. The movies are not isolated and alone. Each row at that point of the query consists of variables: p, x and m.

So when we say that rows are being filtered, we're not talking about filtering out just movies, or just people. We're talking about filtering the streams of rows. Each row is a set of one p, one x, and one m.

This predicate: where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x) is specific to the p and x variables, and has nothing to do with m. Tom Hanks is every p on every row (which isn't that interesting). For the two rows with Charlize Theron, since Charlize and Tom have coacted in something, those rows will be filtered out. That includes the m on each of those rows.

It may help to review this knowledge base article on pattern negation.


(Rcfro2) #9

Ok but doesnt the acted_in relationship refer to the thing that the relationship points to? It's odd that in the case of
match (p) - [:ACTED_IN] -> (l) <- [:ACTED_IN] - (x)
return distinct l.title

refers to all the movies that they both acted in, regardless of the variable L being there - that empty node referring to all the nodes that they both acted in, and the L acting as the variable for that reference.

In short, how does that line read in plain english?
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

I would expect it means - WHERE P AND X HAVE NOT ACTED IN ANY MOVIES TOGETHER. Is that false?

In fact, my main concern is that what you wrote here seems rather arbitrary and/or unspecific:
"For the two rows with Charlize Theron, since Charlize and Tom have coacted in something, those rows will be filtered out. "

Yes they have acted in something, but if each path is evaluated individually, then this negation would not apply to the Devil's Advocate it would seem...


(Andrew Bowman) #10

I think the root of the misunderstanding is the concept of rows. Rows consist of sets of variables (in this case p, x, and m). The Devil's Advocate is only here because it was a movie Charlize Theron acted in, and so is present for a single row with variables: p - Tom Hanks, x - Charlize Theron, m - The Devil's Advocate.

That is a single row in the stream. The filter applies to rows in the stream. This predicate:
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
means where this pattern doesn't exist. But for both of those rows with Charlize Theron as x, that pattern does exist, and while you COULD interpret it as "WHERE P AND X HAVE NOT ACTED IN ANY MOVIES TOGETHER", it's probably more correct to say "where p and x have not acted in something together", as there's no restriction for the label of the node in the middle (if this graph had plays, for example, or broadway shows, then it's possible there's a play or broadway show they acted in together).

But what it comes down to is that for each row (of p, x, and m) in the stream, such a pattern must not exist in the graph. But Charlize and Tom have coacted in something together, so those rows are filtered out. That's an important distinction to make. We're not filtering out the people (as in, separate from the movies). We're filtering out rows, because the people for the variables of those rows (Tom Hanks for p and Charlize Theron for x) do not pass the predicate in the WHERE clause. When those rows are filtered out, that will of course also include the m on each of those filtered out rows.


(Rcfro2) #11

"where p and x have not acted in something together" - I agree. But the question is this: why is this i.e. the "something" not evaluated on a per row basis? So lets take this row:

Charlize Theron - The Devil's Advocate

It's clear that this row does not fit the condition of "something" in which Tom and Charlize have acted together in.

The something, insofar as I understand variables and references to items or things or lists, should be evaluated on a per row basis as other predicates are in which we have no aggregation.

In short, I agree there is something in the number of rows returned in which Charlize and Tom have acted in together, so that holds true. But, on a per row basis, it is not the case they acted in that movie so, then, why does it apply? That's my real confusion. If evaluated per row, the empty node i.e. the "something" should refer to the thing both Tom and Charlize acted in e.g. The Devil's Advocate. But, according to what you have said, it doesn't. Why?


(Andrew Bowman) #12

Ah, okay.

You seem to be assuming that when we do:
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
that the blank node must be m, but that's not true.

If we did use this instead:
where not (p) - [:ACTED_IN] -> (m) <- [:ACTED_IN] - (x)
then we would get the same results as your second query.

Note the difference. In the first one we're just looking for the existence of a pattern through some node (no restrictions). In the second, that middle node must be the m node for the row.

The idea for both of these is: this pattern must not exist in the graph. So, as soon as we find a single instance of such a pattern, then we can stop looking and filter out the row.

As for the idea of the "something" (the "blank" node) not being evaluated on a per row basis, it actually is.

For each row in the stream, expansions are performed between p and x (it's up to the query planner on which one to start out, or if we expand both and node-hash-join in the middle) to see if they are connected by such a pattern, and if it finds that the there is a connection and the pattern exists, then the row is filtered out.

So there is a lot of work being done to fulfill that predicate, as the only way to ensure that such a pattern doesn't exist between the p and x nodes is to expand between those nodes for each row. (that's why I like the approach of collecting coactors first then using a WHERE NOT x in y kind of check, it avoids the need to keep expanding out to see if p and x are coactors)


(Rcfro2) #13

So these two are not equivalent:
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)
where not (p) - [:ACTED_IN] -> (m) <- [:ACTED_IN] - (x)

?

In the first case, removing all rows from the list if ANY row contains X or P (but in this case X, as P is in every row as we previously mentioned).

In the second case, it evaluates on a per row basis, such that it only removes rows where both X and P act in that movie, correct?


(Andrew Bowman) #14

Correct, those two are not equivalent, as there is a preexisting m on each row that would add a restriction to the pattern.

Again, these ARE being evaluated on a per row basis. The difference is in what variables and restrictions apply.

In the first predicate the only variables in play are p and x. If such a pattern exists (with no restriction on the node in the middle), then the row will be filtered out. This executes per row, so it will execute for both rows where p = Tom Hanks and x = Charlize Theron (and yes m is present there too and different on each of those rows), and both of those rows will be filtered out.

In the second predicate, p, m, and x are in involved. The pattern provided must specifically have the m node in the middle. This will execute per row, so it will execute for both rows where p = Tom Hanks and x = Charlize Theron, and since the pattern is only present for the row with m = That Thing You Do, THAT row will be filtered out, and the row with p = Tom Hanks and x = Charlize Theron and m = The Devil's advocate will remain.

Operations always execute per row, that's important to remember. The difference here is just in the pattern we're looking for, whether the node in the middle can be any node, or if it has to be the m node for the row.


(Andrew Bowman) #15

Maybe it will help if I show some alternatives. Here are some predicates we could use, starting with the two we've seen already and adding some other possibilities.

where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

where not (p) - [:ACTED_IN] -> (m) <- [:ACTED_IN] - (x)

where not (p) - [:ACTED_IN] -> (:Movie {title:'Cloud Atlas'}) <- [:ACTED_IN] - (x)

where not (p) - [:ACTED_IN] -> (:BroadwayPlay) <- [:ACTED_IN] - (x)

For all of these, the goal is just to ensure that such a pattern does not exist, given the restrictions we've added to the pattern. In the last two it's easy to see the restrictions here. For the one with m, that m exists for the row, so the restriction is on a specific node. For the "blank" node we're just looking for the pattern without any restriction on the connecting node in the middle.


(Rcfro2) #16

where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

again, why does it remove specifically the Devil's Advocate? I am sorry for being naive or not understanding here but the relationship or path seems to be
(p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

means that P and X if evaluated per Row, and the empty node or P refers to Tom, then the relationship between those 2 variables in the row must maintain in The Devil's Advocate. The only path I see is

Charlize (X) - [:ACTED_IN] -> The Devil's Advocate () <- [:ACTED_IN] - (p)

isnt the empty node refer to the thing Charlize acted in, which in this case the Devil's Advocate? It seems to be removed because the path of Charlize, Tom and The Thing You Do is the only thing where it applies.

Again, apologies for my misunderstanding.


(Andrew Bowman) #17

It does not specifically remove The Devil's Advocate. The Devil's Advocate gets filtered out because its row got filtered out, because the other values for that row (p = Tom Hanks and x = Charlize Theron) didn't pass the predicate.

Remember, the two rows that we're zeroing in on for this are these (projecting out just the name and title):

╒════════════════════════╤═════════════════════════════════╕
│"x.name"                │"m.title"                        │
╞════════════════════════╪═════════════════════════════════╡
│"Charlize Theron"       │"That Thing You Do"              │
├────────────────────────┼─────────────────────────────────┤
│"Charlize Theron"       │"The Devil's Advocate"           │
├────────────────────────┼─────────────────────────────────┤

(there is of course p in scope as well, but that's always Tom Hanks so that's not so interesting).

For these two rows, The Devil's Advocate is only on one of them.

By filtering based on this predicate:

where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x)

we find that Tom Hanks acted in something with Charlize Theron, so that pattern exists, and so each row with Charlize Theron for x will be filtered out, so both of those get filtered out (which happens to include the row with m = The Devil's Advocate). The english equivalent is "where Tom Hanks and x never acted together in something", and since Tom Hanks and Charlize Theron have acted together in something, all rows with x = Charlize Theron get filtered out.

If we used the other predicate:

where not (p) - [:ACTED_IN] -> (m) <- [:ACTED_IN] - (x)

The english equivalent is "where Tom Hanks and x didn't act together in movie m (which x HAS acted in)". This implies that the row (with x) will be kept if x acted in some movie that Tom Hanks didn't act in. Or, put even another way, it implies that for actor x, x will be filtered out completely ONLY if every movie they have acted in has been a movie where they've coacted with Tom Hanks. That gets evaluated per row. For the row where x = Charlize Theron and m = That Thing You Do, since they coacted together in That Thing You Do, that row gets filtered out. For the row where x = Charlize Theron and m = The Devil's Advocate, since Tom Hanks didn't act in that movie that row gets left in, so when you ask for the remaining actors at the end of the query Charlize Theron is one of those that returns in this query, but not the other.

What it comes down to is that by changing the predicate, it means you're completely changing what you're asking the queries to do. "filter out the contacts who have coacted with Tom Hanks in something" is completely different than "filter out the contacts who have coacted with Tom Hanks in every movie the contact has acted in".


(Andrew Bowman) #18

I hope the above has been helpful.

But let's try to finish off by fixing this.

Most of this mess is caused by this part of the match:
(x) - [:ACTED_IN] -> (m)

The above introduces another variable m into our rows, and we end up with a row per p/x/m combination (as you saw, Charlize Theron appeared on multiple rows, once per movie she acted in, and the same applies to all the other actors in the rows at that point, appearing multiple times, once per movie they acted in.

So not only does this inflate the number of rows, but it's also unnecessary. We don't need to match out to m to filter out based on whether or not the contact is a coactor of Tom Hanks.

All we need is this:

match (p:Person{name:'Tom Hanks'}) - [:HAS_CONTACT*2] -> (x:Person)
where not (p) - [:ACTED_IN] -> () <- [:ACTED_IN] - (x) and p<>x
return distinct x.name

Or, if we wanted to be more efficient and cut down on the expansions needed to figure out if they're coactors, we could do the same thing we did in the other thread and collect coactors and filter based on list membership:

match (p:Person{name:'Tom Hanks'})-[:ACTED_IN*2]-(coactor)
with p, collect(distinct coactor) as coactors
match (p) - [:HAS_CONTACT*2] -> (x:Person)
where not x in coactors and p<>x
return distinct x.name

(Rcfro2) #19

Yes, by the Devil's Advocate gets filtered out I meant that row gets filtered out...

OK and that's my issue - "Tom Hanks acted in something with Charlize Theron, so that pattern exists, and so each row with Charlize Theron for x will be filtered out, so both of those get filtered out (which happens to include the row with m = The Devil's Advocate)."

Why ? Actually, the full pattern only holds true for "That Thing You Do". The weird part. This other line you write doesn't make a ton of sense:
"where Tom Hanks and x never acted together in something", and since Tom Hanks and Charlize Theron have acted together in something, all rows with x = Charlize Theron get filtered out."

Why do you end up filtering all rows out in which Charlize Theron is X, and not filtering out every row in which the whole path exists, as it does with the row that contains the Devil's Advocate ?

This sentence you wrote would still hold true if you kept "The Devil's Advocate" --
"where Tom Hanks and x never acted together in something"

Where Tom Hanks AND x never acted together in something. With that AND statement, I would expect that both conditions hold true with an individual row, such as "The Devil's Advocate", but it's clearly not true. Something, being the movie, only has Charlize Theron as an actor/actress, not Tom Hanks.

What am I missing here?

Again, appreciate your help.


(Andrew Bowman) #20

Why ? Actually, the full pattern only holds true for "That Thing You Do".

There. That's all that's needed. Just the fact that any path exists with the restrictions given is all that's needed. The only restrictions for that path are just that p and x have such a path between them, and we aren't interested in what that node in the middle is (we don't care about any of its labels, its properties, or even if it's the same node as a previously matched variable, so it doesn't have to be m). Just that there exists any path adhering to the pattern provided (involving the end nodes of p and x) is enough to filter out the row.

By examining the graph we can determine that the middle connecting node is That Thing You Do, but the actual identity of the node, whether it's that node or The Devil's Advocate or not doesn't matter. Why doesn't it matter? Because we haven't applied any restrictions on the pattern to say that it matters in some way or is restricted in what that node can be. All that matters is that such a pattern exists in the graph, because that WHERE clause means that we don't want rows where that pattern exists: "exclude results where the nodes for x and p are related in this way" is basically what we're asking. It has nothing to do with m, because m isn't present in that pattern.

The pattern in the WHERE clause doesn't have to do with m at all. We could easily use a different predicate that also has nothing to do with p in addition to m:

WHERE NOT (x)-[:ACTED_IN]->(:Movie{title:'Cloud Atlas'})
now the restriction is on whether or not x was an actor in Cloud Atlas, and all rows where the actor acted in Cloud Atlas would be filtered out. Nothing to do with m, and nothing to do with p.

We could instead have a predicate that doesn't even deal with any of the variables at all:
WHERE NOT ()-[:ACTED_IN]->(:Movie{title:'Cloud Atlas'})
The restriction here is that there must not be any actors in Cloud Atlas (no restrictions on what that blank node is, only that there is a node that has an :ACTED_IN relationship outgoing toward Cloud Atlas). Nothing to do with p or m or x. Since Cloud Atlas is in our graph, and has actors, this pattern exists, this pattern will be tested for every row (it doesn't matter that this pattern isn't dependent on any variables at all, filter operations, like all operations, execute per row), and the pattern will still exist for every row, so all rows will be filtered out.

We could simply omit the pattern completely:
WHERE false
That would evaluate per row, and because the WHERE clause fails all nodes would be filtered out.

There is no requirement that the contents of a WHERE clause have to use any variable at all, or any pattern match. There is no requirement that a pattern match, if present, has to involve variables we've matched to previously.

Does that help?