Optimizing Subgraph Queries for Low Latency API calls

apoc
performance
cypher
cloud

(Benjamin Squire) #1

I am working on a project which will interact with heavy loads from front end websites for user identity syncing and I need to achieve extremely low latency in the order of 50-70ms at a 95th percentile range. The nature of the graph is comprised of first party ids as well as other information, such as Email or subscription ids. The rough model is (User {user_id:'1234'})-->(Ids). There is only one relationship type in the graph and there are multiple labels for Ids, each of which has its on Id key indexed, i.e Email nodes have email_id indexed, Subscriber nodes have subscriber_id. Users only have outbound relationships and Ids such as Email_id or Subscription_id only have inbound relationships.

The api works as follows, given an input User_Id "1234" return all Users related to "1234" which are connected by some (Ids). Some nodes are marked Bad because they link too many Users together, these may be deleted in the future, or the relationship label can be changed to allow for the second query to work.

My question is which is faster method for returning related users, apoc.path.subgraphAll() or (u:User)--(u2).

Here are a few queries that I am currently using/considering:

Match (u:User {user_id:'1234'}) with u  CALL apoc.path.subgraphAll(u, {maxLevel:10,labelFilter:"-Bad_Nodes|-OtherBadNodes|>User",filterStartNode:False}) YIELD nodes return nodes;
Match (u:User {user_id:'1234'})-[*1..10]-(u2:User) return u2.user_id ;

Would it be faster to link connected users together ahead of time?, This would change my graph model but making hyper edges that point either way or purposefully creating two relationships (User1)-->(User2) and (User2)-->(User1) or just ignore direction in the query (User1)--(User2), would this be faster?

Side question: What would be the best/fastest way to interact with the Neo4j Graph given some serverless architecture like Chalice or Lambda, Python/Java/other? These should be small fast transactions.

Using Community 3.4.9
EC2 in AWS with 244 GB RAM and 32 Procs


(Andrew Bowman) #2

I'd favor subgraphNodes() over subgraphAll(), as subgraphAll() will attempt to find all relationships between the found nodes, and you seem to be only after distinct node ids.

As far as filtering out / pruning paths with bad nodes during expansion, and only returning User nodes, the config parameters look good to me, though keep in mind that the starting node will also be returned (so you might want to SKIP 1 to avoid it).

In a moderately-to-heavy connected graph the Cypher expansion alone is not likely to perform well, as it will be attempting to find all possible paths, which could traverse the same nodes multiple times during expansion. Adjusting the query to find distinct nodes will help a little, but it's likely APOC's subgraphNodes() will perform better.

MATCH (u:User {user_id:'1234'})-[*1..10]-(u2:User) 
WITH DISTINCT u2
RETURN u2.user_id ;

(Benjamin Squire) #3

Thanks Andrew,
Will parameterization work to help cache query planning with Apoc procedures? What about using Optional Match criteria with a parameterized user_id, might I see a better performance?


(Andrew Bowman) #4

Where you can parameterize, you should, so yes.

I'm not quite sure what you mean by "Optional Match criteria with a parameterized user_id". Do you mean that you don't know if the id provided will actually match to a node, and you want to handle that case? If you want to ensure that rows don't get wiped out when you do the expansion (such as if you have post-processing to do after the subgraphNodes() call that needs to be done whether or not there are connected users to the starting node), you can add optional:true to the config parameters of the procedure.


(Benjamin Squire) #5

Something like this as opposed to the Apoc Procedure subgraphNodes, note that 6 nodes away by definition of the graph is equivalent to 10 hops in SubgraphNodes:

:param {user_id:'1234'}
Match (u: User {user_id: $user_id})
Optional Match (u)-->(id:Email)<--(u2)
Optional Match (u)-->(id:Subscription)<--(u2)
Optional Match (u2)-->(id2:Email)<--(u3)
Optional Match (u2)-->(id2:Subscription)<--(u3)
...
Optional Match (u5)-->(id5:Email)<--(u6)
Optional Match (u5)-->(id5:Subscription)<--(u6)
with COLLECT(u)+COLLECT(u2)+COLLECT(u3)+...+COLLECT(u6) as bigcol
UNWIND bigcol AS u
return u

Would this style be faster due to Query Cache and Planning compared with Apoc.Path.SubgraphNodes +Parameter?


(Andrew Bowman) #6

I don't believe it would be faster, but it does depend upon the interconnectedness of your graph.

The behavior in subgraphNodes() that tends to make it faster is that it will only visit each node encountered once, which can drastically cut down on traversals as it won't ever repeat traversal of previously traversed nodes and relationships.

Cypher on the other hand is designed to find all possible paths, so if multiple paths end up traversing through the same nodes, then those nodes can be revisited multiple times, and the same paths traversed multiple times, basically multiplicative and redundant expansions and filterings.

As for the query caching you get with parameterization, the only place here to parameterize is on the user_id, so both queries can take advantage of that.


(Benjamin Squire) #7

Excellent! Thank you for your help. Note: Max DeMarzi linked me to this which I am considering too if SubGraphNodes doesn't meet my latency requirements. https://maxdemarzi.com/2018/10/01/finding-your-neighbors-using-neo4j/


(Michael Hunger) #8

How many nodes do you have in your graph?
I think if you have 1..10 level, the something like Max' procedure makes most sense at your SLA.


(Benjamin Squire) #9

In total there at 5.4 Billions nodes: 1.6 Billion Users and a few Billion of each of the Id nodes and currently I am . Most connected subgraphs in the nodes are in the range of 6-40~50 nodes. For the few nodes which do show hyper connectivity such as a single ID nodes with Millions of users I am flagging and relabeling/deleting those connections.

I will try and parse out what Max is doing, I am a little rusty on Java but can make my way through. I am doing a thorough statistical performance testing on raw cypher vs apoc.path.subgraphNode vs Maxes method to show runtimes on across the entire graph, will post results on a blog post in the community when I complete.


(Michael Hunger) #10

You should also use enterprise which has the better runtime.
If you really need that low latency on your queries you have to make sure the relevant data is in the page-cache.

Not using any rel-types in your query is not helping either :slight_smile:

You should try to either use this:

Match (u:User {user_id:'1234'})-[*1..10]-(u2:User) with distinct u2 return u2.user_id ;

Or if possible spell out the steps:

 Match (u:User {user_id:'1234'})-->()-->(u2) WITH distinct u2
 match (u2)-->(u3) WITH distinct u3
 ...
 WITH distinct u10
 RETURN u10.user_id

But the procedure with better control over expansion and already visited nodes will give you the best performance.