How to get n nodes per label without expanding all

Imran · August 16, 2020, 12:32am

I am trying to limit the number of related nodes in my result per label. Let's say I have a node type A that is related to nodes with different labels (X, Y, Z, etc.). (a:A)-[:RELATED_TO]->(b), here b can have either one of the (X, Y, Z, etc.) labels. For each A I want to fetch maximum 5 related nodes of each label. Sometimes one A node can be RELATED_TO thousands of (X, Y, Z, etc.) nodes.

I tried the below query which gives the desired result, but is not optimal because it expands all of a's relationships:

MATCH (a:A)-[:RELATED_TO]->(b) WHERE a.someCondition
WITH labels(b)[0] as label, collect(b)[0..5] as nodes
UNWIND nodes as n
RETURN n

Another way I have tried is matching separately per label with limit and using union:

MATCH (a:A)-[:RELATED_TO]->(b:X) WHERE a.someCondition
RETURN b limit 5
UNION
MATCH (a:A)-[:RELATED_TO]->(b:Y) WHERE a.someCondition
RETURN b limit 5
(... and so on for other labels I want)

The problem with the second approach is that I am performing the same match multiple times for each part of the union.
Is there a better way to get the desired results without having to expand all of A's relationships?

ameyasoft · August 16, 2020, 7:13am

Try this:

MATCH (a:A)-[:RELATED_TO]->(b) WHERE a.someCondition
with a, count(labels(b)) as cnt
with a, cnt where cnt = 5
return distinct labels(b) as lbl, cnt

asperlinga · August 16, 2020, 7:24am

or think about CASE statement

(DB Northwind)

MATCH (a:Customer)-[:PURCHASED]->(b:Order)
where a.companyName starts with "A"
RETURN
CASE WHEN count(labels(b))>0 THEN
"Country "+a.country+" Order "+b.orderID END as result

Imran · August 16, 2020, 9:09pm

Thanks guys. I am not interested in getting the nodes with a certain number of labels, rather a certain number of nodes per label.
Eg. for a path match (a:A)-[:RELATED_TO]->(b) return b, there can be thousands of potential b's. Each b has exactly one label. Lets say the above query returns 90k nodes -> 30k with label X, 30k with label Y, 30k with label Z. I would like to get 5 of each, without having to incur the cost of expandAll on those 90k nodes

anthapu · August 16, 2020, 9:48pm

Since you have to do collect there is no way to do without doing full traversal in Cypher.

Best way is to use a store procedure or if you know label distribution of b you can try

In that case you can try this

UNWIND ['lbl1', 'lbl2'] as label
MATCH (a:A)-[:RELATED_TO]->(b) WHERE label in labels(b)
WITH a, b
LIMIT 10
RETURN b

Imran · August 16, 2020, 10:18pm

I think this will return 10 results of just 'lbl1' and won't return 5 of each

Imran · August 16, 2020, 10:29pm

I was able to come up with this using apoc that seems promising. I would love to get opinions on whether you guys see any issues with it:

MATCH (a:A) where a.someCondition
CALL apoc.cypher.run('
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:X) RETURN b limit 5
UNION
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:Y) RETURN b limit 5
UNION
WITH {a} as a MATCH (a)-[:RELATED_TO]->(b:Z) RETURN b limit 5
', {a:a})
YIELD value RETURN value.b

It seems to return 5 nodes per label without doing full traversal. However, I think it will do full traversal if there are less than 5 nodes for a certain label.

anthapu · August 16, 2020, 10:34pm

I guess you are limited to a union query or store procedure here.

anthapu · August 16, 2020, 11:32pm

With apoc you can try this

UNWIND ['lbl1', 'lbl2'] as label
CALL apoc.cypher.run('MATCH (a:A)-[:RELATED_TO]->(b:'+label+') RETURN b limit 5', {}) yield value
return value

It returns at the most 5 of each label values

Imran · August 17, 2020, 1:45am

This seems perfect. Thanks! Wondering if there is a way to avoid full traversal when less than 5 nodes are present of a specific label. I guess the only way would be to store counts of related nodes on A for each label type, and do a conditional. Might be a bit of an overkill to avoid an occasional full traversal

andrew_bowman · August 25, 2020, 6:02am

Note that with Neo4j 4.1, you can use subqueries for this, within which you can use LIMIT that is scoped only to the subquery (and not applying it to all rows of the outer query).

For example:

UNWIND ['lbl1', 'lbl2'] as label
CALL { 
  WITH label
  MATCH (a:A)-[:RELATED_TO]->(b) WHERE label in labels(b)
  RETURN b
  LIMIT 10
}
RETURN b

Imran · September 16, 2020, 9:54pm

Thanks. Haven't tested it on Neo4j 4.1, but just curios about the expected behavior of this subquery. If there are 50,000 nodes of 'lbl1' and 5 nodes of 'lbl2', will it traverse all 50,005 nodes to get 10 of each? If yes, any suggestions of possible optimizations?

andrew_bowman · September 16, 2020, 10:32pm

No, I wouldn't think so. The WHERE label in labels(b) part shouldn't be considered by the planner for where to start looking. It should start with a label scan of :A nodes, expanding :RELATED_TO relationships to try to find a connected node of lbl1 or lbl2. Once 10 matches are found it will stop looking.

Imran · September 16, 2020, 11:20pm

Oh okay. In that case, it's possible that it might return 10 nodes of 'lbl1' and 0 nodes of 'lbl2'. My goal is to get at most 5 nodes for each label.

andrew_bowman · September 17, 2020, 12:08am

Ah, I'm a bit mistaken. Because of the UNWIND, the MATCH will find LIMIT 10 per label. If you want it 5 each, make that LIMIT 5.

Imran · September 17, 2020, 12:56am

Oh yeah, got it. Because of unwind the subquery will run once for each label.
My question was regarding the label scan where there are less than 10 nodes for the label (using limit 10). Let's say it's running the subquery for 'lbl2' with limit set to 10. Let's assume that :A has 50,000 neighbors none of which has 'lbl2'. Will the query end up traversing all of the 50,000 neighbors to know that there are no 'lbl2' nodes (that's how it works in the 3.x versions I think), or does Neo4j maintain some sort of count of neighbors per label internally, and will avoid a full traversal?

andrew_bowman · September 17, 2020, 1:15am

The behavior is the same. Nodes don't know anything about their neighbors, excepting the relationship types and directions. It would have to expand to and filter all its neighbors in its attempt to find the limited number of results.

If you used more specific relationship types, then you could make this more efficient. If the relationships were :RELATED_TO_LBL1 and :RELATED_TO_LBL2, then it would only select those relationships to traverse (when a node is dense its structure for holding relationships changes, making selection of relationships efficient and avoiding the need to filter all relationships). Since there would be no :RELATED_TO_LBL2 relationships, it wouldn't need to expand or filter anything, just immediately return no results for that subquery invocation.

Topic		Replies	Views
How to limit on db.index.fulltext.queryNodes per label? Procedures & APOC apoc , cypher , operations	1	437	January 13, 2021
Trying to limit my match to 10 results for a node of a specific type and its related nodes. Limit not working as expected Cypher cypher	1	275	November 19, 2021
Limit for multiple selection of nodes and relationships Cypher cypher	1	310	December 27, 2021
Writing out performant queries for n-relationship queries on a single node where n > 2 Cypher	20	3575	October 14, 2019
Cypher Query to only return a limited amount of connected nodes Cypher cypher	2	58	August 27, 2025

How to get n nodes per label without expanding all

Related topics