Querying on multiple labels and property values appear not to use indexes


(A Scerri) #1

Hi

I've been experimenting with Neo4J 3.5.0 and looking at one particular query. My initial work was based on a Gremlin query which was performing badly, and i've now tried several different constructions in Gremlin as well as now trying various forms in Cypher.

The main thing thing is being able to query for different node labels both of which share a common property and additionally checking for values in a list. Given the size of graph, 500M nodes i'd like to have it use indexes (all of which are created) but so far it seems to resort to a FILTER (for the property match) over the matched nodes (from matching the label).

I've tried the simple form:

MATCH (a) WHERE (a:label1 or a:label2 or a:label3) AND a.propName in ['val1','val2','val3'] RETURN count(a)

Then i've tried several different constructions using more elaborate OR and UNION. The closest i've come to is:

MATCH (a:label1) WHERE a.propName in ['val1','val2','val3'] RETURN a
UNION
MATCH (a:label2) WHERE a.propName in ['val1','val2','val3'] RETURN a
UNION
MATCH (a:label3) WHERE a.propName in ['val1','val2','val3'] RETURN a

However this yields a list of nodes rather than the simpler count i was aiming for.

I was looking for any suggestions for getting the simpler type of query working effeciently, or a way to adapt the more verbose query to get a (distinct) count.

Thanks for any help.

Tony


(Andrew Bowman) #2

Provided that you don't have any nodes multi-labeled with these labels, you could just do 3 OPTIONAL MATCHes and add up the counts for each:

OPTIONAL MATCH (n:label1) 
WHERE n.propName in ['val1','val2','val3'] 
WITH count(n) as count
OPTIONAL MATCH (a:label2) 
WHERE n.propName in ['val1','val2','val3'] 
WITH count + count(n) as count
OPTIONAL MATCH (n:label3) 
WHERE n.propName in ['val1','val2','val3'] 
RETURN count + count(n) as count

If the nodes can be multi-labeled, then the problem becomes more difficult, as we would need to collect the results and de-duplicate and finally get the final count.

That said...the fact that the same propName is used for all 3 labels, and the values can exist across them, suggests a relationship among these kinds of nodes. Would it make sense for your model to add an additional label for nodes of these types, that way you could more generically match using the super label (after creating an index for it)? Alternately, would it make sense to create any graph structures between nodes like this, and match to some common node which is related and use that as a starting point?


(A Scerri) #3

HI

Thanks for the informatio, I had tried something very similar but without the OPTIONAL and got zero, i've tried with the OPTIONAL and that gets me the right count. The graph model in question is something we will be likely looking at but for now i was just getting a first pass over some queries and this seemed like one that should be easy enough to get working optimally.

For the Gremlin equivalent it doesnt look like the use of WITHIN to match the multiple property values may be using the index and using UNIONs or ORs it seemed to fall back to the same kind of query plan but i'll have another play with that one tomorrow, at least for Cypher there is a workable solution.

Thanks again

Tony


(Michael Hunger) #4

It should work too if you just do multiple matches with different variable names.
But then you span up a cartesian product and need to do count(distinct n) which is probably not in your interest.

So Andrew's solution should work best.