Why do these 2 queries not return the same result

Hello everyone,

I am currently trying to write basic Cypher queries on the Paradise Papers database. I wrote 2 queries that do not return exactly the same result, and I don't see why. Could someone clarify ?

Basically, I want to get all Entities that have as officers both people with a name containing "Jo" and people with a name containing "Bo"

QUERY 1 :

MATCH (officer1:Officer)-[:OFFICER_OF]->(entity:Entity)<-[:OFFICER_OF]-(officer2:Officer)
WHERE officer1.name CONTAINS 'Bo' AND officer2.name CONTAINS 'Jo'
RETURN DISTINCT entity
ORDER BY entity.name;

which returns 1203 records after 2357 ms

QUERY 2 :

MATCH (officer1:Officer)-[:OFFICER_OF]->(entity1:Entity)
WHERE officer1.name CONTAINS 'Bo'
WITH COLLECT(DISTINCT entity1) AS entities1
MATCH (officer2:Officer)-[:OFFICER_OF]->(entity2:Entity)
WHERE officer2.name CONTAINS 'Jo'
WITH entities1, COLLECT(DISTINCT entity2) AS entities2
WITH [entity IN entities1 WHERE entity IN entities2] AS commonEntities
UNWIND commonEntities AS commonEntity
RETURN commonEntity
ORDER BY commonEntity.name;

which returns 1226 records after 349 ms

Why do these 2 queries not return the same result ?
Also, any idea why the second one is faster ? It seemed to me that the first one would be faster

How did you get your data? I downloaded the dump file and created a new DB. The relationship types in my data are all lower case. The figures in the GitHub page do show the schema with all upper case relationship types.

If I had to guess, I would say that the dataset must have some officers' names that contain both 'Bo' and 'Jo.'

The MATCH statement has TRAIL semantics: the same relationship is never traversed twice in the same statement. In the first query, you cannot have officer1 and officer2 being the same entity. Somebody with a name with both substrings (like: Bond Jolin) could appear as officer1 and as officer2, but the Trail semantics prevents it.

In the second query, you split the MATCH statement into two, so you can have Bond Jolin matching twice, as officer1 and as officer2.

For the performance part of your question, you should always share the query plan if you want somebody to help.

The cause of the difference is because the second query is not eliminating the cases when the officer has both 'Bo' and 'Jo' in their name. This will cause the Officer to show up in both of your individual queries, and since they point to the same entity, you get the entity erroneously included in your 'commonEntity' result.

This will not happen in the first query, since neo4j will not traverse the same relationship in a given match statement.

If you are interested in how I figured it out, I used the following query to identify which entities were different between the two groups:

call {
    MATCH (officer1:Officer)-[:officer_of]->(entity1:Entity)
    WHERE officer1.name CONTAINS 'Bo'
    WITH COLLECT(DISTINCT entity1) AS entities1
    MATCH (officer2:Officer)-[:officer_of]->(entity2:Entity)
    WHERE officer2.name CONTAINS 'Jo'
    WITH entities1, COLLECT(DISTINCT entity2) AS entities2
    WITH [entity IN entities1 WHERE entity IN entities2] AS commonEntities
    UNWIND commonEntities AS commonEntity
    RETURN collect(id(commonEntity)) as entities1
}
call {
    MATCH (officer1:Officer)-[:officer_of]->(entity:Entity)<-[:officer_of]-(officer2:Officer)
    WHERE officer1.name CONTAINS 'Bo' AND officer2.name CONTAINS 'Jo'
    RETURN collect(DISTINCT id(entity)) as entities2
}
with
[i in entities1 where not i in entities2] as list1, 
[i in entities2 where not i in entities1] as list2
return list1, list2

As you can see, the smaller query had no entries that were not in the larger query's result. I then looked at one of these entities to understand what officers it was related to. Taking the first one, you see the entity was only related to one officer. This officer has both "Bo" and "Jo" in its name, explaining why it was in both sets of the larger query.

The entity did not show up in the smaller query because the pattern would not return a result where Officer1 equals Officer2, since it would require neo4j to traverse the same 'officer_of' relationship, which is not allowed.

1 Like