Optimizing a query with a subgraph/subquery, only look at specific nodes

eric7
Node

Hello,

I am working on a relatively complex query in my social graph db (~millions of nodes).
Part of the query determines all the nodes that are relevant to my query (typically <100 nodes).
For example,

match (u1:User {user_id: '1234'})-[:REL]-(u2:User)
with collect(u2) as all_related_users
...

all_related_users are the only user nodes I care about this point, so later on I would like to perform something like this

with all_related_users
match (u3:User)-[:REL]->(u4:User)
where u3 in all_related_users
and u4 in all_related_users
...

However, this doesn't do exactly what I want it to do as when I look at the profiler, the MATCH is run as an Expand(All) incurring a huge amount of db hits and rows (50k) until the Filter in the following is hit. This is obviously slow because it is doing a WHERE IN for 50k rows.
Ideally what I want to do is after calculating all_related_users I could tell Neo4j, "these are the only nodes I care about this point so only run queries on this extremely small subgraph". I've tried doing subqueries with CALL but that doesn't seem to make any difference. Is there anyway to either fix my queries to only run queries on a very small set of nodes, or just create a temporary subgraph inside the query. I'm coming from SQL so maybe my understanding is off.

THANKS SO MUCH!

4 REPLIES 4

You're close, you can use UNWIND to project elements of a list into rows, so this should be a one-line change:

match (u1:User {user_id: '1234'})-[:REL]-(u2:User)
with collect(u2) as all_related_users
...
UNWIND all_related_users as u3
MATCH (u3)-[:REL]->(u4)
WHERE u4 in all_related_users
...

I've tried that but it seems like it still evaluates the match before running the filter.

Here's my code snippet

unwind all_mutual_contacts as u2
match (u2:User)<-[:CONTACT]-(cic:User) // get all incoming contacts
where cic in all_mutual_contacts

Looking at the query planner I see
Expand(All) ~42k hits
Filter ~2.4k hits

If my understanding is right this means where where cic in all_mutal_contacts gets run 42k times where the size of all_mutual_contacts is 245. Furthermore, if we already know all cic, wouldn't this be an Expand(Into), thanks for all the help!

There's no way to filter before the expand. You have to find the connected nodes before filtering that the connected nodes are in the list.

Expand(Into) only works when we explicitly have both nodes as variables, so you would need to UNWIND the list a second time, generating a cross product, and that would result in far more expansions. You can try it if you like.

Also, pay attention to the rows in the query plan. The more rows there are, the more work needs to be done for the subsequent operation.

eric7
Node

Thanks Andrew, the cross product and Expand(Into) does indeed create more expansions, I'm guessing because now the input is much higher. I will keep digging but thank you for your help!

I guess I will look into parallelizing my code in this case then. Thanks!