Why does a single label massively alter my query plan?

I have the following query which runs very slowly:

However, removing a single label from the left-most matched node massively increases the speed due to preventing a cartesian join:

I don't understand why the label makes such a (terrible) difference to the query plan.

To avoid cartesian product, you are better to constrain the labels after separately and compare those constraints. so match (vi) where vi:first_label and vi:second_label return vi;

But obviously add the rest of your query around this. there is an article on this somewhere i will try and find, which probably explains it better

Found it :-) How do I report on nodes with multiple labels - Knowledge Base Hope this helps. Let me know if you have anymore questions

Hi thanks for the quick response. I tried the suggested change which appears to have no effect on the query plan.

The following 3 variants run exactly the same query plan:

PROFILE
MATCH (vi:VI_PS_xxxx:VI_PH_xxxx)-[:VI_PH]->(ph:PH_VI_xxxx)
RETURN count(DISTINCT ph) AS PH_Id

PROFILE
MATCH (vi)-[:VI_PH]->(ph)
WHERE vi:VI_PS_xxxx AND vi:VI_PH_xxxx AND ph:PH_VI_xxxx
RETURN count(DISTINCT ph) AS PH_Id

PROFILE
MATCH (vi) WHERE vi:VI_PS_xxxx AND vi:VI_PH_xxxx
WITH vi
MATCH (vi)-[:VI_PH]->(ph:PH_VI_xxxx)
RETURN count(DISTINCT ph) AS PH_Id

Neo 3.5.3 Enterprise

also as I'm not familiar with the data model it you might be able to further improve this performance. In both profiles you will see a Expand(all) block, which basically has to iterate over all the relationships for each starting node. For example if you have 10 nodes that match the first MATCH and each has 100 relationships named :VI_PH then its 10 x 100 operations to count. However there is metadata at each node which is a precomputed value representing the number of relationships by direction. For example if there a node which has 5 relationships associated named :VI_PH that have a outgoing direction and 12 relationships associated named :VI_PH that have a incoming direction then this data is recorded in the metadata and thus does not require you to iterate over each relationship and count 1 by 1. To access the metadata this would be accomplished via a return SIZE ( (:VI_PS_?????)-[:VI_PH]->() ) ; When using this shortcut the PROFILE will include a 'GetDegree` function, for example

Projection
15,636 
db hits
n, size ( (n)-[:IS_ASSIGNED_
TO]->() )
{size ( (n)-[:IS_ASSIGNED_TO
]->() ) : GetDegree(Variable
(n),Some(RelTypeName(IS_
ASSIGNED_TO)),OUTGOING)}

See How do I improve the performance of counting number of relationships on a node - Knowledge Base for more details

The query you're executing is using COMPILED runtime. I'm curious if prefixing it with CYPHER runtime=SLOTTED or CYPHER runtime=INTERPRETED would get you a better plan. Can you let us know?

The plans for SLOTTED and INTERPRETED are identical except that the Expand(Into) portion has 112,173,677 db hits instead of 8,062,221 db hits for COMPILED

Thanks, so it's not a compiled runtime issue.

You are using an older version of 3.5.x. Do you have any ability to test this out on a copied and upgraded db? 3.5.14 is the latest patch release.

Ok, just tried this with 3.5.14 with exactly the same results

Can you also check if the results are the same on 3.5.14 when prefixed with CYPHER runtime=SLOTTED or CYPHER runtime=INTERPRETED?

Just tried it - same results with both.

For Cypher to think a cartesian project and expand into is the most efficient plan must mean that it thinks the cardinality of the combined labels is very low. It could think this if the individual label cardinality is relatively low, and it has no idea of correlation between cardinalities. For example, if the cardinality of (n:A) is 10% of all nodes, and the cardinality of (n:B) is also 10%, we would model the cardinality of (n:A:B) as 1% (ie. 0.1*0.1=0.01). But if many :A are also :B, then this calculation is very inaccurate, and for full correlation between :A and :B we should use the minimum of the two, which in my example would be 10% instead of 1%.

Could you let us know what the relative cardinalities are? Is there a correlation between nodes of the two types?

I have two more suggestions for making your query faster. One is to try your trick with only one label, and then before the RETURN, add an extra WITH vi, ph, 1 AS ignore WHERE vi:VI_PH_xxxx RETURN DISTINCT(ph).

The other suggestion is based on the fact that you seem to only want to know the number of unique ph for which there exists any relationship of that pattern, but are not interested in the count of relationships themselves. For this there is a specific syntax:

MATCH (ph:PH_VI_xxxx)
WHERE (vi:VI_PS_xxxx:VI_PH_xxxx)-[:VI_PH]->(ph)
RETURN count(DISTINCT ph) AS PH_Id
All nodes              13,601,888
:VI_PH_xxxx                   380
:VI_PS_xxxx                   480
:VI_PH_xxxx:VI_PS_xxxx        380

Your last suggestion - putting the relationship in the WHERE clause seems to massively improve the query plan. I will see if we can generalise this approach in our query builder code.

I'm glad the suggestion helped. I noticed also that your labels are strongly correlated. Every PS node is also a PH node. The planner certainly does not make good use of that information. We have plans to teach it that trick (or rather to maintain the necessary graph statistics to make that observation), but not sure when that will be done.

In the meantime, I'm very happy the last suggestion helped you get a faster query. The phrase 'massively improve' sounds like music to my ears!