Sub-optimal query plan?

Hello,

We're using neo4j 4.4.23 and have the following simplified model (I can't share the real model but the structure is the same):

node labels: Person { name }, Location, Food
relationship types: VISITED, ATE
indexes: CREATE BTREE INDEX idx_person_name FOR (p:Person) ON (p.name)

We use the following two queries of similar structure to audit our data:

Q1 (distinct locations visited by different persons with the same name):

MATCH (p:Person)-[:VISITED]->(l:Location)<-[:VISITED]-(p2:Person)
WHERE p.name=p2.name and ID(p) <> ID(p2)
RETURN COUNT(distinct l)

Q2 (distinct foods eaten by different persons with the same name):

MATCH (p:Person)-[:ATE]->(f:Food)<-[:ATE]-(p2:Person)
WHERE p.name=p2.name and ID(p) <> ID(p2)
RETURN COUNT(distinct f)

Q1 takes some time to complete (15-20 minutes) but Q2 crushes the database. Oddly, they have different query plans despite having the same structure and referencing the same indexed property. Q1 scans persons once and then follows relationships from p1 to p2 whereas Q2 double scans all persons, follows relationships on both sides until f and finally does a join:

Note: I had to edit the label names and relationship types since I can't share the real data model.

Some data statistics :

  • about 15M Person nodes
  • about 70M Location nodes
  • about 20M Food nodes
  • about 30M VISITED relationships
  • about 50M ATE relationships

Since the orders of magnitude are the same (tens of millions), I would have expected the execution plans to be the same, but maybe the differences between the numbers are enough to trigger different execution plans.

But it also looks to me that Q2's execution plan is sub-optimal (scanning parties two times and then doing a join instead of following relationships), is it really? I have little knowledge of neo4j's query execution mechanics.

We tried to recompute database statistics and to manually replan cached execution plans but it did not help. We also tried the dp query planner. Any ideas of something we would be missing or anything we should try? How can we hint the query planner to choose a plan for Q2 similar to Q1's?

Thank you for your help.

Try this:

MATCH (p:Person)-[:VISITED]->(l:Location)
WITH distinct p.name as persons,  count(distinct l.location) as place
RETURN persons, place order by place desc

MATCH (p:Person)-[:ATE]->(f:Food)
WITH distinct p.name as persons,  count(distinct f.food) as food
RETURN persons, food order by food desc

Hello, thank you for your response.

These queries seem not to return the total number of distinct locations visited (or foods eaten) by different persons with the same name as Q1 and Q2 do. They seem to return counts of distinct locations visited (or foods eaten) by groups of people with the same name instead. And some locations/foods may be shared by different groups of people with the same name, so even adding a WITH sum(place) AS total does not give the same result as Q1.

Hi @markarasev ,

This is an interesting topic. Just in order to get an idea on your data, what's the result of:

MATCH (p:Person)
WHERE EXISTS{(pro:Person) WHERE pro.name = p.name and pro <> p } 
WITH distinct p.name as name
RETURN COUNT(name)

Hello @bennu_neo,

The result of this query is 1197831.
If I understand correctly, it is the number of names shared by at least two Persons, right?