Optimizing a query with a subgraph/subquery, only look at specific nodes

eric7 · June 9, 2021, 5:50pm

Hello,

I am working on a relatively complex query in my social graph db (~millions of nodes).
Part of the query determines all the nodes that are relevant to my query (typically <100 nodes).
For example,

match (u1:User {user_id: '1234'})-[:REL]-(u2:User)
with collect(u2) as all_related_users
...

all_related_users are the only user nodes I care about this point, so later on I would like to perform something like this

with all_related_users
match (u3:User)-[:REL]->(u4:User)
where u3 in all_related_users
and u4 in all_related_users
...

However, this doesn't do exactly what I want it to do as when I look at the profiler, the MATCH is run as an Expand(All) incurring a huge amount of db hits and rows (50k) until the Filter in the following is hit. This is obviously slow because it is doing a WHERE IN for 50k rows.
Ideally what I want to do is after calculating all_related_users I could tell Neo4j, "these are the only nodes I care about this point so only run queries on this extremely small subgraph". I've tried doing subqueries with CALL but that doesn't seem to make any difference. Is there anyway to either fix my queries to only run queries on a very small set of nodes, or just create a temporary subgraph inside the query. I'm coming from SQL so maybe my understanding is off.

THANKS SO MUCH!

andrew_bowman · June 9, 2021, 10:34pm

You're close, you can use UNWIND to project elements of a list into rows, so this should be a one-line change:

match (u1:User {user_id: '1234'})-[:REL]-(u2:User)
with collect(u2) as all_related_users
...
UNWIND all_related_users as u3
MATCH (u3)-[:REL]->(u4)
WHERE u4 in all_related_users
...

eric7 · June 10, 2021, 5:58pm

I've tried that but it seems like it still evaluates the match before running the filter.

Here's my code snippet

unwind all_mutual_contacts as u2
match (u2:User)<-[:CONTACT]-(cic:User) // get all incoming contacts
where cic in all_mutual_contacts

Looking at the query planner I see
Expand(All) ~42k hits
Filter ~2.4k hits

If my understanding is right this means where where cic in all_mutal_contacts gets run 42k times where the size of all_mutual_contacts is 245. Furthermore, if we already know all cic, wouldn't this be an Expand(Into), thanks for all the help!

andrew_bowman · June 10, 2021, 7:29pm

There's no way to filter before the expand. You have to find the connected nodes before filtering that the connected nodes are in the list.

Expand(Into) only works when we explicitly have both nodes as variables, so you would need to UNWIND the list a second time, generating a cross product, and that would result in far more expansions. You can try it if you like.

Also, pay attention to the rows in the query plan. The more rows there are, the more work needs to be done for the subsequent operation.

eric7 · June 10, 2021, 9:36pm

Thanks Andrew, the cross product and Expand(Into) does indeed create more expansions, I'm guessing because now the input is much higher. I will keep digging but thank you for your help!

I guess I will look into parallelizing my code in this case then. Thanks!

Topic		Replies	Views
Cypher Query to only return a limited amount of connected nodes Cypher cypher	2	235	August 27, 2025
Is there any way of querying on small part of a big graph in graph db Cypher cypher , knowledge-base	3	328	August 3, 2020
Questions about my query model Cypher querying , optimization , cypher , subquery	13	305	March 6, 2025
Query all complete net of a specific node Cypher querying , cypher	8	437	November 3, 2021
Query to Connect Result Nodes Cypher cypher	1	708	December 18, 2020

Take the Course Then Join The Aura Agent Hackathon

Optimizing a query with a subgraph/subquery, only look at specific nodes

Related topics

Take the Course Then Join
The Aura Agent Hackathon