Find all User nodes where there is NO relationship between the "user" Node and "group_member" Nodes (ie, they are not in any of the same groups), and also filter by certain properties of the "group_member" Nodes. (I'm having a hard time figuring this one out)
Find all User nodes where there is a specific relationship between the "user" Node and "group_member" Nodes, and also filter by certain properties of the "group_member" Nodes. (I have this subquery working, so I don't need help here).
UNION 1 & 2 and then execute a fairly complicated sort and limit.
Of course if there is a way to query 1 & 2 in the same "MATCH/WHERE" that would be preferred but I'm not seeing a way to do this.
Any pointers in the right direction would be awesome!
would something like this work? Strategy: find user1 node, and the list of all users to check against, then find all users who you can't reach user1 through any group node?
MATCH (user1:User {name:"bob"}),(user2:User)
WHERE NOT (user1)-[:MEMBERSHIP]->(g:Group)<-[:MEMBERSHIP]-(user2)
AND g.name='filter me in' // group property restriction
return user2.name
Do you have a small test set you could share? (create statements for test set would help)
MATCH (user1:User {user_id:166}),(user2:User)
WHERE (NOT (user1)-[:MEMBERSHIP]->(:Group)<-[:MEMBERSHIP]-(user2)
AND distance(user1.location, user2.location) < 20000)
RETURN user2 AS user
MATCH (user1:User {user_id:166})-[:MEMBERSHIP]->(:Group)<-[user2_rel:MEMBERSHIP]-(user2:User)
WHERE user2_rel.status='left' AND distance(user1.location, user2.location) < 20000
RETURN user2 AS user
Basically, the first returns all users that are not in any groups with user1 that are within 20 km of user1. The second returns all users that have left a group that they were in with user1 and are within 20 km.
I'd like to combine both sets, sort the union by certain properties of user2, and them limit to a specific number. I know that using UNION is slow and also you have very limited options on the UNION'ed set. So it seems like the best thing would be to combine both into one statement if it can be done.
It is best to not spend time optimizing until it is working and even then, only if required. You should be able to put a UNION between those two, to concatenate the results.
simple example below, here I'm renaming column names so I can union the results
RETURN 'apoc.version' as lib_name, apoc.version() as version
UNION
RETURN 'gds.version' as lib_name, gds.version() as version
Note: If you are neo4j 4.x you can also do this
CALL {
RETURN 'apoc.version' as lib_name, apoc.version() as version
UNION
RETURN 'gds.version' as lib_name, gds.version() as version
}
return lib_name, version
order by lib_name desc
So, I now am able to construct a full query that works. The part of the query that deals with existing relationships is performing well. However the part that is pulling no relationships is performing terribly.
Here is that query:
MATCH (current_user:GraphUser {user_id:166}),(potential_user:GraphUser)
WITH current_user AS current_user, potential_user AS potential_user,
distance(current_user.location,potential_user.location) AS distance
WHERE (NOT (current_user)-[:GRAPH_MEMBERSHIP]->()<-[:GRAPH_MEMBERSHIP]-(potential_user))
AND distance < 20000
RETURN potential_user, distance
ORDER BY distance ASC
LIMIT 10
There are about 100K GraphUser nodes, so not even that many.
Here is the profile output:
I don't understand why the filtering needs to do millions of db hits for a set of under 100k rows, but i think it has something to do with the Cartesian product. But at over a second, this is performing much worse than the statement in SQL.
Thanks for the help so far, maybe I'm missing something obvious.
One tip I learned from the video, is you can move up the LIMIT command so that the query will end earlier (which is different from SQL.). I haven't quite got the hang of it myself, so I'm not sure exactly how to transform your query.
Try doing the filtering in two steps, I'd first try filtering for distance < 20000, then WITH those results apply the WHERE NOT
If/how this helps optimize depends on the data content, but what we're trying to do is quickly narrow the cartesian product list down to something much smaller as fast as possible, so you might also try applying the WHERE NOT first, and then WITH those results calculate the distance of the remaining results applying the distance<20000 filter second. My gut tells me to try the simple distance check first, but you never know for sure until you try it, I really enjoy the surprises..