Query Tuning Help - Lots of DB hits?

I have a relative simply query that I'm trying to improve performance on. The query serves up a catalog of projects and we want to display some metrics of those projects when they browse the catalog. This catalog receives a lot of traffic (1,000+ calls per minute) and this most called and longest running query. The query receives a list of 50 project ids, as the user scrolls there's an async call to fetch the next 50.

The goals for the query is to return per project:

  • Overall LIKE count
  • Overall FAVORITE count
  • Overall SHARE count
  • For the user at the keyboard if they've liked (boolean) the project
  • For the user at the keyboard if they've favorited (boolean) the project

Data Model

Cypher Query

MATCH (proj:Project)
WHERE proj.project_id IN $projectIds
RETURN
  proj.project_id AS project_id,
  SIZE( (:Profile)-[:LIKE]->(proj) ) AS likeCount,
  SIZE( (:Profile)-[:FAVORITE]->(proj) ) AS bmCount,
  SIZE( (:Profile)-[:SHARE]->(proj) ) AS shareCount,
  SIZE( (:Profile {profile_id: $profileId})-[:LIKE]->(proj) ) > 0 AS currentProfLiked,
  SIZE( (:Profile {profile_id: $profileId})-[:FAVORITE]->(proj) ) > 0 AS currentProfBooked
;

*Note the use of parameters so the DB should be able to cache the query plan

Execution Plan

My assumptions about the execution is that many of these projects are very popular so they have a lot of incoming relationships, so even though I'm using query projection to sum up their count, that's still a lot of data to calculate which why I'm seeing the DB hits.

Another assumption I have, is because in each query we also calculate the individual user's interaction with the project list, the database can't do any result caching because the parameter values are always changing. So even if the catalog is the same for everyone, the fact that their profile is in there, the results can't be cached for the next user. I'm wondering if I should split this into two separate queries, one for the project perspective, then a second query for the profile perspective?

I'd like to solve this in Neo4j but if we can't, we've talked about caching the results up in Redis to alleviate the load on the database. I want to avoid this because I love Neo4j so I like to use it to fullest extent, I'm also not a fan of caches because then you're managing stale data, and I think it's a bad habit to use a cache to hide a bad query when I should instead learn how to write better queries.

At a minimum try rewriting the initial query to

MATCH (proj:Project)
WHERE proj.project_id IN $projectIds
RETURN
  proj.project_id AS project_id,
  SIZE( ()-[:LIKE]->(proj) ) AS likeCount,
  SIZE( ()-[:FAVORITE]->(proj) ) AS bmCount,
  SIZE( ()-[:SHARE]->(proj) ) AS shareCount,
  SIZE( (:Profile {profile_id: $profileId})-[:LIKE]->(proj) ) > 0 AS currentProfLiked,
  SIZE( (:Profile {profile_id: $profileId})-[:FAVORITE]->(proj) ) > 0 AS currentProfBooked
;

to which I have changed line 5,6,7. Provided the model is as you describe such that a :Profile only has :LIKE, :FAVORITE or :SHARE relationship to a :Project node the above change would produce the same results and should be faster. How much faster ???

Internally every node has metadata which describes the number of incoming/outgoing relationships and by direction. However the metadata does not record incoming/outgoing relationships and by direction and to destination node Label. So if you want to ask how many :LIKE relationships does node with label :Project and proj_id=123 and as incoming then this is a 1 dbhit operation. However if you want to ask how many :LIKE relationships does node with label :Project and proj_id=123 and as incoming to a node with Label :Profile then we need to iterate through all the :LIKE relationships of the :Project node and check its relationship type and direction. If you preface the query with PROFILE the you will see the last Projection block will have detail similar to

shareCount : GetDegree(
Variable(proj),Some(
RelTypeName(SHARE)),INCOMING
), currentProfBooked : SIZE(
`anon[314]`) > $` AUTOINT4`
, currentProfLiked : SIZE(`
anon[236]`) > $` AUTOINT2`,
project_id : proj.project_
id, bmCount : GetDegree(
Variable(proj),Some(
RelTypeName(FAVORITE)),
INCOMING), likeCount :
GetDegree(Variable(proj),
Some(RelTypeName(LIKE)),
INCOMING)

the reference to GetDegree in this block would indicat we are simply looking at the internal metadata for the node and not actually iterating over each relationship.

Also regarding caches the usage of $parameters should mean the query plan itself is cached, so each run of the query regardless of $profile_id should not require a replan. Results are not cached. Rather data that has been accessed/traversed is cached. How large is your graph ( this should be generally available if via the Neo4j Browser you run :sysinfo and with some interpretation. And relative to the size of your graph how have you configured dbms.memory.pagecache.size in conf/neo4j.con

1 Like

Your suggestion helped a lot!

The original query had 1552892 total db hits in 8362 ms. By removing the remove the node label in the SIZE() aggregation improves the performance to 854 total db hits in 442 ms. A 18.91 times faster! Sub-second response time.

1 Like

Great improvement, thank you @dana.canzano!

you might also be able to fetch the (:Profile {profile_id: $profileId}) once at the beginning (even before the projects) and then use that variable and instead using size() perhaps use exists( (profile)-[:LIKE]->(proj) )