Do aggregations not make use of page cache?
I have a database with 4.6G of data, and 3.3G of indexes, the page cache is 10G, and is currently using 7.9G.
My queries are showing millions of db hits, even after all the db is loaded in memory?
Do aggregations not make use of page cache?
I have a database with 4.6G of data, and 3.3G of indexes, the page cache is 10G, and is currently using 7.9G.
My queries are showing millions of db hits, even after all the db is loaded in memory?
As far as I know, the number of dbhits has nothing to do with whether the graph is loaded into memory or not. The way you write your cypher and the number of rows produced by different components of your cypher determine the number of dbhits. A dbhit does not necessary mean an access to the disk. You can write a simple query that retrieves one single node using index. You will see that even after running the cypher once, the number of dbhits does not change although the data is in memory now. You can read more about dbhits here: Execution plans - Cypher Manual
Thanks a lot for the info and link @shan.
I'll go over the link now. Do you also happen to know how we can optimize a query like this to return as quickly as possible?
MATCH (t:Track)-[:HAS_GENRE]->(g:Genre)
WHERE g.name IN ['rock', 'metal']
RETURN t.name, collect(DISTINCT g.name) AS genres, count(DISTINCT g) AS score ORDER BY score DESC LIMIT 20
I have about 7Million tracks and 2K genres.
Currently the aggregations in their expand and filter operations are hitting the db several million times, and making the query very slow.
No problem.
I am afraid nothing comes to mind in order to be improve your cypher performance. You are hitting those 7M tracks so it'll be slow.
If you're just needing to return aggregations, I've found better query plans and performance by using the Pattern Comprehension technique. This saves the database from actually having to read any of data that you might have told it fetch in the MATCH
clause.
Here's an example of how a rewrite of you query might look, though you'll might need to adjust it.
MATCH (t:Track)
RETURN t.name,
SIZE( (t)-[:HAS_GENRE]->(g:Genre {name: 'rock'}) ) > 0 AS has_rock,
SIZE( (t)-[:HAS_GENRE]->(g:Genre {name: 'metal'}) ) > 0 AS has_metal,
SIZE( (t)-[:HAS_GENRE]->(g:Genre) ) AS genre_count
ORDER BY genre_count DESC
LIMIT 20
Thanks a lot @mike_r_black!
This is very interesting, I'll try it right now, and get back to you.
Happy thanksgiving!
Hello @mike_r_black,
So I tried this query:
MATCH (t:Track)
RETURN t.name,
SIZE( (t)-[:HAS_GENRE]->(:Genre {name: 'rock'}) ) > 0 AS has_rock,
SIZE( (t)-[:HAS_GENRE]->(:Genre {name: 'metal'}) ) > 0 AS has_metal,
SIZE( (t)-[:HAS_GENRE]->(:Genre) ) AS genre_count
ORDER BY genre_count DESC
LIMIT 20
Profile:
This turned out to be even more costly than the previous one unfortunately.
Maybe, I am doing a bad job explaining the problem I have. What I want basically is if I query for ['rock', 'metal'], I want to return the tracks that have both of them first, and then tracks who have only one of them.
I want the query to return in less that a few hundred milliseconds.
I have already looked at match intersection, and learned about the count method from there, which obviously is too slow.
Is there any method that might help solve this problem?
This query would return tracks that have both genres
MATCH (g1:Genre {name: 'rock'})<-[:HAS_GENRE]-(t:Track)-[:HAS_GENRE]->(g2:Genre {name: 'metal'})
RETURN t
This would be the query for track that has one genre but not the other
MATCH (t:Track)-[:HAS_GENRE]->(g2:Genre {name: 'metal'})
WHERE NOT (:Genre {name: 'rock'})<-[:HAS_GENRE]-(t)
RETURN t
Is this what you're trying to achieve? I assume you have an index on Genre.name
to help speed up the seeking of the genre nodes?
Thanks again for the answer @mike_r_black.
I assume you have an index on
Genre.name
to help speed up the seeking of the genre nodes?
Yes.
Is this what you're trying to achieve?
Not exactly. A few followup questions:
If the query to find track with both genres fails, I want to find tracks that have either of the genres. These queries might not cover that use case right?
What if there are more than 2 genres I want to query on? That would definitely require something like IN
right?
It's best to access properties as late as possible.
So a simplistic optimization would be:
MATCH (t:Track)-[:HAS_GENRE]->(g:Genre)
WHERE g.name IN ['rock', 'metal']
WITH t, collect(g) AS genres, count(g) AS score ORDER BY score DESC LIMIT 20
RETURN t.name, [g in genres | g.name] as genres, score
But I presume you actually want to have all genres of each track, not just your 2.
MATCH (t:Track)-[:HAS_GENRE]->(g:Genre) WHERE g.name IN ['rock', 'metal']
WITH t, size( (t)-[:HAS_GENRE]->() ) as score
ORDER by score DESC LIMIT 20
MATCH (t)-[:HAS_GENRE]->(g)
RETURN t.name, collect(g.name) AS genres, score