I have a relative simply query that I'm trying to improve performance on. The query serves up a catalog of projects and we want to display some metrics of those projects when they browse the catalog. This catalog receives a lot of traffic (1,000+ calls per minute) and this most called and longest running query. The query receives a list of 50 project ids, as the user scrolls there's an async call to fetch the next 50.
The goals for the query is to return per project:
- Overall
LIKE
count - Overall
FAVORITE
count - Overall
SHARE
count - For the user at the keyboard if they've liked (boolean) the project
- For the user at the keyboard if they've favorited (boolean) the project
Data Model
Cypher Query
MATCH (proj:Project)
WHERE proj.project_id IN $projectIds
RETURN
proj.project_id AS project_id,
SIZE( (:Profile)-[:LIKE]->(proj) ) AS likeCount,
SIZE( (:Profile)-[:FAVORITE]->(proj) ) AS bmCount,
SIZE( (:Profile)-[:SHARE]->(proj) ) AS shareCount,
SIZE( (:Profile {profile_id: $profileId})-[:LIKE]->(proj) ) > 0 AS currentProfLiked,
SIZE( (:Profile {profile_id: $profileId})-[:FAVORITE]->(proj) ) > 0 AS currentProfBooked
;
*Note the use of parameters so the DB should be able to cache the query plan
Execution Plan
My assumptions about the execution is that many of these projects are very popular so they have a lot of incoming relationships, so even though I'm using query projection to sum up their count, that's still a lot of data to calculate which why I'm seeing the DB hits.
Another assumption I have, is because in each query we also calculate the individual user's interaction with the project list, the database can't do any result caching because the parameter values are always changing. So even if the catalog is the same for everyone, the fact that their profile is in there, the results can't be cached for the next user. I'm wondering if I should split this into two separate queries, one for the project perspective, then a second query for the profile perspective?
I'd like to solve this in Neo4j but if we can't, we've talked about caching the results up in Redis to alleviate the load on the database. I want to avoid this because I love Neo4j so I like to use it to fullest extent, I'm also not a fan of caches because then you're managing stale data, and I think it's a bad habit to use a cache to hide a bad query when I should instead learn how to write better queries.