Why does a single label massively alter my query plan?

tinycode2 · February 13, 2020, 3:04pm

I have the following query which runs very slowly:

However, removing a single label from the left-most matched node massively increases the speed due to preventing a cartesian join:

I don't understand why the label makes such a (terrible) difference to the query plan.

philiprichardjames · February 13, 2020, 3:37pm

To avoid cartesian product, you are better to constrain the labels after separately and compare those constraints. so match (vi) where vi:first_label and vi:second_label return vi;

But obviously add the rest of your query around this. there is an article on this somewhere i will try and find, which probably explains it better

philiprichardjames · February 13, 2020, 3:49pm

Found it :-) How do I report on nodes with multiple labels - Knowledge Base Hope this helps. Let me know if you have anymore questions

tinycode2 · February 13, 2020, 4:43pm

Hi thanks for the quick response. I tried the suggested change which appears to have no effect on the query plan.

The following 3 variants run exactly the same query plan:

PROFILE
MATCH (vi:VI_PS_xxxx:VI_PH_xxxx)-[:VI_PH]->(ph:PH_VI_xxxx)
RETURN count(DISTINCT ph) AS PH_Id

PROFILE
MATCH (vi)-[:VI_PH]->(ph)
WHERE vi:VI_PS_xxxx AND vi:VI_PH_xxxx AND ph:PH_VI_xxxx
RETURN count(DISTINCT ph) AS PH_Id

PROFILE
MATCH (vi) WHERE vi:VI_PS_xxxx AND vi:VI_PH_xxxx
WITH vi
MATCH (vi)-[:VI_PH]->(ph:PH_VI_xxxx)
RETURN count(DISTINCT ph) AS PH_Id

tinycode2 · February 13, 2020, 4:45pm

Neo 3.5.3 Enterprise

dana_canzano · February 13, 2020, 4:49pm

also as I'm not familiar with the data model it you might be able to further improve this performance. In both profiles you will see a Expand(all) block, which basically has to iterate over all the relationships for each starting node. For example if you have 10 nodes that match the first MATCH and each has 100 relationships named :VI_PH then its 10 x 100 operations to count. However there is metadata at each node which is a precomputed value representing the number of relationships by direction. For example if there a node which has 5 relationships associated named :VI_PH that have a outgoing direction and 12 relationships associated named :VI_PH that have a incoming direction then this data is recorded in the metadata and thus does not require you to iterate over each relationship and count 1 by 1. To access the metadata this would be accomplished via a return SIZE ( (:VI_PS_?????)-[:VI_PH]->() ) ; When using this shortcut the PROFILE will include a 'GetDegree` function, for example

Projection
15,636 
db hits
n, size ( (n)-[:IS_ASSIGNED_
TO]->() )
{size ( (n)-[:IS_ASSIGNED_TO
]->() ) : GetDegree(Variable
(n),Some(RelTypeName(IS_
ASSIGNED_TO)),OUTGOING)}

See How do I improve the performance of counting number of relationships on a node - Knowledge Base for more details

andrew_bowman · February 13, 2020, 6:35pm

The query you're executing is using COMPILED runtime. I'm curious if prefixing it with CYPHER runtime=SLOTTED or CYPHER runtime=INTERPRETED would get you a better plan. Can you let us know?

tinycode2 · February 13, 2020, 7:05pm

The plans for SLOTTED and INTERPRETED are identical except that the Expand(Into) portion has 112,173,677 db hits instead of 8,062,221 db hits for COMPILED

andrew_bowman · February 13, 2020, 7:57pm

Thanks, so it's not a compiled runtime issue.

You are using an older version of 3.5.x. Do you have any ability to test this out on a copied and upgraded db? 3.5.14 is the latest patch release.

tinycode2 · February 17, 2020, 3:37pm

Ok, just tried this with 3.5.14 with exactly the same results

andrew_bowman · February 18, 2020, 4:11pm

Can you also check if the results are the same on 3.5.14 when prefixed with CYPHER runtime=SLOTTED or CYPHER runtime=INTERPRETED?

tinycode2 · February 18, 2020, 5:17pm

Just tried it - same results with both.

craig.taverner · February 19, 2020, 9:04am

For Cypher to think a cartesian project and expand into is the most efficient plan must mean that it thinks the cardinality of the combined labels is very low. It could think this if the individual label cardinality is relatively low, and it has no idea of correlation between cardinalities. For example, if the cardinality of (n:A) is 10% of all nodes, and the cardinality of (n:B) is also 10%, we would model the cardinality of (n:A:B) as 1% (ie. 0.1*0.1=0.01). But if many :A are also :B, then this calculation is very inaccurate, and for full correlation between :A and :B we should use the minimum of the two, which in my example would be 10% instead of 1%.

Could you let us know what the relative cardinalities are? Is there a correlation between nodes of the two types?

I have two more suggestions for making your query faster. One is to try your trick with only one label, and then before the RETURN, add an extra WITH vi, ph, 1 AS ignore WHERE vi:VI_PH_xxxx RETURN DISTINCT(ph).

The other suggestion is based on the fact that you seem to only want to know the number of unique ph for which there exists any relationship of that pattern, but are not interested in the count of relationships themselves. For this there is a specific syntax:

MATCH (ph:PH_VI_xxxx)
WHERE (vi:VI_PS_xxxx:VI_PH_xxxx)-[:VI_PH]->(ph)
RETURN count(DISTINCT ph) AS PH_Id

tinycode2 · February 19, 2020, 11:04am

All nodes              13,601,888
:VI_PH_xxxx                   380
:VI_PS_xxxx                   480
:VI_PH_xxxx:VI_PS_xxxx        380

tinycode2 · February 19, 2020, 11:11am

Your last suggestion - putting the relationship in the WHERE clause seems to massively improve the query plan. I will see if we can generalise this approach in our query builder code.

craig.taverner · March 2, 2020, 9:19pm

I'm glad the suggestion helped. I noticed also that your labels are strongly correlated. Every PS node is also a PH node. The planner certainly does not make good use of that information. We have plans to teach it that trick (or rather to maintain the necessary graph statistics to make that observation), but not sure when that will be done.

In the meantime, I'm very happy the last suggestion helped you get a faster query. The phrase 'massively improve' sounds like music to my ears!

Topic		Replies	Views
Variable path with Label is very slow(2nd post) Cypher	3	230	January 25, 2024
Yet another Cartesian Product problem Cypher	5	583	May 18, 2020
Label limits for the query planner? Cypher performance , cypher	1	260	April 4, 2023
Query Performance for Label Matching Cypher	3	281	November 25, 2021
Why does specifying the label of a node result in a Cartesian product? Cypher performance , cypher	7	279	October 19, 2023

Why does a single label massively alter my query plan?

Related topics