I am trying to limit the number of related nodes in my result per label. Let's say I have a node type A that is related to nodes with different labels (X, Y, Z, etc.). (a:A)-[:RELATED_TO]->(b), here b can have either one of the (X, Y, Z, etc.) labels. For each A I want to fetch maximum 5 related nodes of each label. Sometimes one A node can be RELATED_TO thousands of (X, Y, Z, etc.) nodes.
I tried the below query which gives the desired result, but is not optimal because it expands all of a's relationships:
MATCH (a:A)-[:RELATED_TO]->(b) WHERE a.someCondition
WITH labels(b)[0] as label, collect(b)[0..5] as nodes
UNWIND nodes as n
RETURN n
Another way I have tried is matching separately per label with limit and using union:
MATCH (a:A)-[:RELATED_TO]->(b:X) WHERE a.someCondition
RETURN b limit 5
UNION
MATCH (a:A)-[:RELATED_TO]->(b:Y) WHERE a.someCondition
RETURN b limit 5
(... and so on for other labels I want)
The problem with the second approach is that I am performing the same match multiple times for each part of the union.
Is there a better way to get the desired results without having to expand all of A's relationships?
Try this:
MATCH (a:A)-[:RELATED_TO]->(b) WHERE a.someCondition
with a, count(labels(b)) as cnt
with a, cnt where cnt = 5
return distinct labels(b) as lbl, cnt
MATCH (a:Customer)-[:PURCHASED]->(b:Order)
where a.companyName starts with "A"
RETURN
CASE WHEN count(labels(b))>0 THEN
"Country "+a.country+" Order "+b.orderID END as result
Thanks guys. I am not interested in getting the nodes with a certain number of labels, rather a certain number of nodes per label.
Eg. for a path match (a:A)-[:RELATED_TO]->(b) return b, there can be thousands of potential b's. Each b has exactly one label. Lets say the above query returns 90k nodes -> 30k with label X, 30k with label Y, 30k with label Z. I would like to get 5 of each, without having to incur the cost of expandAll on those 90k nodes
I was able to come up with this using apoc that seems promising. I would love to get opinions on whether you guys see any issues with it:
MATCH (a:A) where a.someCondition
CALL apoc.cypher.run('
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:X) RETURN b limit 5
UNION
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:Y) RETURN b limit 5
UNION
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:Z) RETURN b limit 5
', {a:a})
YIELD value RETURN value.b
It seems to return 5 nodes per label without doing full traversal. However, I think it will do full traversal if there are less than 5 nodes for a certain label.
This seems perfect. Thanks! Wondering if there is a way to avoid full traversal when less than 5 nodes are present of a specific label. I guess the only way would be to store counts of related nodes on A for each label type, and do a conditional. Might be a bit of an overkill to avoid an occasional full traversal
Note that with Neo4j 4.1, you can use subqueries for this, within which you can use LIMIT that is scoped only to the subquery (and not applying it to all rows of the outer query).
For example:
UNWIND ['lbl1', 'lbl2'] as label
CALL {
WITH label
MATCH (a:A)-[:RELATED_TO]->(b) WHERE label in labels(b)
RETURN b
LIMIT 10
}
RETURN b
Thanks. Haven't tested it on Neo4j 4.1, but just curios about the expected behavior of this subquery. If there are 50,000 nodes of 'lbl1' and 5 nodes of 'lbl2', will it traverse all 50,005 nodes to get 10 of each? If yes, any suggestions of possible optimizations?
No, I wouldn't think so. The WHERE label in labels(b) part shouldn't be considered by the planner for where to start looking. It should start with a label scan of :A nodes, expanding :RELATED_TO relationships to try to find a connected node of lbl1 or lbl2. Once 10 matches are found it will stop looking.
Oh yeah, got it. Because of unwind the subquery will run once for each label.
My question was regarding the label scan where there are less than 10 nodes for the label (using limit 10). Let's say it's running the subquery for 'lbl2' with limit set to 10. Let's assume that :A has 50,000 neighbors none of which has 'lbl2'. Will the query end up traversing all of the 50,000 neighbors to know that there are no 'lbl2' nodes (that's how it works in the 3.x versions I think), or does Neo4j maintain some sort of count of neighbors per label internally, and will avoid a full traversal?
The behavior is the same. Nodes don't know anything about their neighbors, excepting the relationship types and directions. It would have to expand to and filter all its neighbors in its attempt to find the limited number of results.
If you used more specific relationship types, then you could make this more efficient. If the relationships were :RELATED_TO_LBL1 and :RELATED_TO_LBL2, then it would only select those relationships to traverse (when a node is dense its structure for holding relationships changes, making selection of relationships efficient and avoiding the need to filter all relationships). Since there would be no :RELATED_TO_LBL2 relationships, it wouldn't need to expand or filter anything, just immediately return no results for that subquery invocation.