Query Tuning Help - Lots of DB hits?

mike_r_black · March 17, 2020, 3:41pm

I have a relative simply query that I'm trying to improve performance on. The query serves up a catalog of projects and we want to display some metrics of those projects when they browse the catalog. This catalog receives a lot of traffic (1,000+ calls per minute) and this most called and longest running query. The query receives a list of 50 project ids, as the user scrolls there's an async call to fetch the next 50.

The goals for the query is to return per project:

Overall LIKE count
Overall FAVORITE count
Overall SHARE count
For the user at the keyboard if they've liked (boolean) the project
For the user at the keyboard if they've favorited (boolean) the project

Data Model

Cypher Query

MATCH (proj:Project)
WHERE proj.project_id IN $projectIds
RETURN
  proj.project_id AS project_id,
  SIZE( (:Profile)-[:LIKE]->(proj) ) AS likeCount,
  SIZE( (:Profile)-[:FAVORITE]->(proj) ) AS bmCount,
  SIZE( (:Profile)-[:SHARE]->(proj) ) AS shareCount,
  SIZE( (:Profile {profile_id: $profileId})-[:LIKE]->(proj) ) > 0 AS currentProfLiked,
  SIZE( (:Profile {profile_id: $profileId})-[:FAVORITE]->(proj) ) > 0 AS currentProfBooked
;

*Note the use of parameters so the DB should be able to cache the query plan

Execution Plan

My assumptions about the execution is that many of these projects are very popular so they have a lot of incoming relationships, so even though I'm using query projection to sum up their count, that's still a lot of data to calculate which why I'm seeing the DB hits.

Another assumption I have, is because in each query we also calculate the individual user's interaction with the project list, the database can't do any result caching because the parameter values are always changing. So even if the catalog is the same for everyone, the fact that their profile is in there, the results can't be cached for the next user. I'm wondering if I should split this into two separate queries, one for the project perspective, then a second query for the profile perspective?

I'd like to solve this in Neo4j but if we can't, we've talked about caching the results up in Redis to alleviate the load on the database. I want to avoid this because I love Neo4j so I like to use it to fullest extent, I'm also not a fan of caches because then you're managing stale data, and I think it's a bad habit to use a cache to hide a bad query when I should instead learn how to write better queries.

dana_canzano · March 23, 2020, 12:33am

At a minimum try rewriting the initial query to

MATCH (proj:Project)
WHERE proj.project_id IN $projectIds
RETURN
  proj.project_id AS project_id,
  SIZE( ()-[:LIKE]->(proj) ) AS likeCount,
  SIZE( ()-[:FAVORITE]->(proj) ) AS bmCount,
  SIZE( ()-[:SHARE]->(proj) ) AS shareCount,
  SIZE( (:Profile {profile_id: $profileId})-[:LIKE]->(proj) ) > 0 AS currentProfLiked,
  SIZE( (:Profile {profile_id: $profileId})-[:FAVORITE]->(proj) ) > 0 AS currentProfBooked
;

to which I have changed line 5,6,7. Provided the model is as you describe such that a :Profile only has :LIKE, :FAVORITE or :SHARE relationship to a :Project node the above change would produce the same results and should be faster. How much faster ???

Internally every node has metadata which describes the number of incoming/outgoing relationships and by direction. However the metadata does not record incoming/outgoing relationships and by direction and to destination node Label. So if you want to ask how many :LIKE relationships does node with label :Project and proj_id=123 and as incoming then this is a 1 dbhit operation. However if you want to ask how many :LIKE relationships does node with label :Project and proj_id=123 and as incoming to a node with Label :Profile then we need to iterate through all the :LIKE relationships of the :Project node and check its relationship type and direction. If you preface the query with PROFILE the you will see the last Projection block will have detail similar to

shareCount : GetDegree(
Variable(proj),Some(
RelTypeName(SHARE)),INCOMING
), currentProfBooked : SIZE(
`anon[314]`) > $` AUTOINT4`
, currentProfLiked : SIZE(`
anon[236]`) > $` AUTOINT2`,
project_id : proj.project_
id, bmCount : GetDegree(
Variable(proj),Some(
RelTypeName(FAVORITE)),
INCOMING), likeCount :
GetDegree(Variable(proj),
Some(RelTypeName(LIKE)),
INCOMING)

the reference to GetDegree in this block would indicat we are simply looking at the internal metadata for the node and not actually iterating over each relationship.

Also regarding caches the usage of $parameters should mean the query plan itself is cached, so each run of the query regardless of $profile_id should not require a replan. Results are not cached. Rather data that has been accessed/traversed is cached. How large is your graph ( this should be generally available if via the Neo4j Browser you run :sysinfo and with some interpretation. And relative to the size of your graph how have you configured dbms.memory.pagecache.size in conf/neo4j.con

mike_r_black · March 24, 2020, 3:40am

Your suggestion helped a lot!

The original query had 1552892 total db hits in 8362 ms. By removing the remove the node label in the SIZE() aggregation improves the performance to 854 total db hits in 442 ms. A 18.91 times faster! Sub-second response time.

michael.hunger · March 24, 2020, 11:53am

Great improvement, thank you @dana_canzano!

you might also be able to fetch the (:Profile {profile_id: $profileId}) once at the beginning (even before the projects) and then use that variable and instead using size() perhaps use exists( (profile)-[:LIKE]->(proj) )

Topic		Replies	Views
Why so many DB hits? Cypher performance , cypher	2	296	March 23, 2023
Trying to squeeze as much performance as I can out of this complex query Cypher	2	507	December 31, 2019
Query Performance/Tuning on 10-15M nodes Cypher performance , cypher	1	238	July 12, 2021
When using PROFILE, should we optimize db hits or time in ms of the first time run? Cypher	2	155	October 9, 2023
Performance query Cypher	0	756	October 29, 2018

July Summer Fun!

Query Tuning Help - Lots of DB hits?

Related topics