To avoid cartesian product, you are better to constrain the labels after separately and compare those constraints. so match (vi) where vi:first_label and vi:second_label return vi;
But obviously add the rest of your query around this. there is an article on this somewhere i will try and find, which probably explains it better
Hi thanks for the quick response. I tried the suggested change which appears to have no effect on the query plan.
The following 3 variants run exactly the same query plan:
PROFILE
MATCH (vi:VI_PS_xxxx:VI_PH_xxxx)-[:VI_PH]->(ph:PH_VI_xxxx)
RETURN count(DISTINCT ph) AS PH_Id
PROFILE
MATCH (vi)-[:VI_PH]->(ph)
WHERE vi:VI_PS_xxxx AND vi:VI_PH_xxxx AND ph:PH_VI_xxxx
RETURN count(DISTINCT ph) AS PH_Id
PROFILE
MATCH (vi) WHERE vi:VI_PS_xxxx AND vi:VI_PH_xxxx
WITH vi
MATCH (vi)-[:VI_PH]->(ph:PH_VI_xxxx)
RETURN count(DISTINCT ph) AS PH_Id
also as I'm not familiar with the data model it you might be able to further improve this performance. In both profiles you will see a Expand(all) block, which basically has to iterate over all the relationships for each starting node. For example if you have 10 nodes that match the first MATCH and each has 100 relationships named :VI_PH then its 10 x 100 operations to count. However there is metadata at each node which is a precomputed value representing the number of relationships by direction. For example if there a node which has 5 relationships associated named :VI_PH that have a outgoing direction and 12 relationships associated named :VI_PH that have a incoming direction then this data is recorded in the metadata and thus does not require you to iterate over each relationship and count 1 by 1. To access the metadata this would be accomplished via a return SIZE ( (:VI_PS_?????)-[:VI_PH]->() ) ; When using this shortcut the PROFILE will include a 'GetDegree` function, for example
Projection
15,636
db hits
n, size ( (n)-[:IS_ASSIGNED_
TO]->() )
{size ( (n)-[:IS_ASSIGNED_TO
]->() ) : GetDegree(Variable
(n),Some(RelTypeName(IS_
ASSIGNED_TO)),OUTGOING)}
The query you're executing is using COMPILED runtime. I'm curious if prefixing it with CYPHER runtime=SLOTTED or CYPHER runtime=INTERPRETED would get you a better plan. Can you let us know?
The plans for SLOTTED and INTERPRETED are identical except that the Expand(Into) portion has 112,173,677 db hits instead of 8,062,221 db hits for COMPILED
For Cypher to think a cartesian project and expand into is the most efficient plan must mean that it thinks the cardinality of the combined labels is very low. It could think this if the individual label cardinality is relatively low, and it has no idea of correlation between cardinalities. For example, if the cardinality of (n:A) is 10% of all nodes, and the cardinality of (n:B) is also 10%, we would model the cardinality of (n:A:B) as 1% (ie. 0.1*0.1=0.01). But if many :A are also :B, then this calculation is very inaccurate, and for full correlation between :A and :B we should use the minimum of the two, which in my example would be 10% instead of 1%.
Could you let us know what the relative cardinalities are? Is there a correlation between nodes of the two types?
I have two more suggestions for making your query faster. One is to try your trick with only one label, and then before the RETURN, add an extra WITH vi, ph, 1 AS ignore WHERE vi:VI_PH_xxxx RETURN DISTINCT(ph).
The other suggestion is based on the fact that you seem to only want to know the number of unique ph for which there exists any relationship of that pattern, but are not interested in the count of relationships themselves. For this there is a specific syntax:
MATCH (ph:PH_VI_xxxx)
WHERE (vi:VI_PS_xxxx:VI_PH_xxxx)-[:VI_PH]->(ph)
RETURN count(DISTINCT ph) AS PH_Id
Your last suggestion - putting the relationship in the WHERE clause seems to massively improve the query plan. I will see if we can generalise this approach in our query builder code.
I'm glad the suggestion helped. I noticed also that your labels are strongly correlated. Every PS node is also a PH node. The planner certainly does not make good use of that information. We have plans to teach it that trick (or rather to maintain the necessary graph statistics to make that observation), but not sure when that will be done.
In the meantime, I'm very happy the last suggestion helped you get a faster query. The phrase 'massively improve' sounds like music to my ears!