Does count aggregation not use page cache?

shash · November 28, 2019, 2:42pm

Do aggregations not make use of page cache?

I have a database with 4.6G of data, and 3.3G of indexes, the page cache is 10G, and is currently using 7.9G.

My queries are showing millions of db hits, even after all the db is loaded in memory?

shan · November 28, 2019, 6:57pm

As far as I know, the number of dbhits has nothing to do with whether the graph is loaded into memory or not. The way you write your cypher and the number of rows produced by different components of your cypher determine the number of dbhits. A dbhit does not necessary mean an access to the disk. You can write a simple query that retrieves one single node using index. You will see that even after running the cypher once, the number of dbhits does not change although the data is in memory now. You can read more about dbhits here: Execution plans - Cypher Manual

shash · November 28, 2019, 7:11pm

Thanks a lot for the info and link @shan.

I'll go over the link now. Do you also happen to know how we can optimize a query like this to return as quickly as possible?

MATCH (t:Track)-[:HAS_GENRE]->(g:Genre) 
WHERE g.name IN ['rock', 'metal']
RETURN t.name, collect(DISTINCT g.name) AS genres, count(DISTINCT g) AS score ORDER BY score DESC LIMIT 20

I have about 7Million tracks and 2K genres.

Currently the aggregations in their expand and filter operations are hitting the db several million times, and making the query very slow.

shan · November 28, 2019, 8:05pm

No problem.

I am afraid nothing comes to mind in order to be improve your cypher performance. You are hitting those 7M tracks so it'll be slow.

mike_r_black · November 29, 2019, 3:39am

If you're just needing to return aggregations, I've found better query plans and performance by using the Pattern Comprehension technique. This saves the database from actually having to read any of data that you might have told it fetch in the MATCH clause.

Here's an example of how a rewrite of you query might look, though you'll might need to adjust it.

MATCH (t:Track)
RETURN t.name, 
    SIZE( (t)-[:HAS_GENRE]->(g:Genre {name: 'rock'}) ) > 0 AS has_rock,
    SIZE( (t)-[:HAS_GENRE]->(g:Genre {name: 'metal'}) ) > 0 AS has_metal,
    SIZE( (t)-[:HAS_GENRE]->(g:Genre) ) AS genre_count
ORDER BY genre_count DESC 
LIMIT 20

shash · November 29, 2019, 3:49am

Thanks a lot @mike_r_black!

This is very interesting, I'll try it right now, and get back to you.
Happy thanksgiving!

shash · November 29, 2019, 4:26am

Hello @mike_r_black,

So I tried this query:

MATCH (t:Track)
RETURN t.name, 
    SIZE( (t)-[:HAS_GENRE]->(:Genre {name: 'rock'}) ) > 0 AS has_rock,
    SIZE( (t)-[:HAS_GENRE]->(:Genre {name: 'metal'}) ) > 0 AS has_metal,
    SIZE( (t)-[:HAS_GENRE]->(:Genre) ) AS genre_count
ORDER BY genre_count DESC 
LIMIT 20

Profile:

This turned out to be even more costly than the previous one unfortunately.

Maybe, I am doing a bad job explaining the problem I have. What I want basically is if I query for ['rock', 'metal'], I want to return the tracks that have both of them first, and then tracks who have only one of them.

I want the query to return in less that a few hundred milliseconds.

I have already looked at match intersection, and learned about the count method from there, which obviously is too slow.

Is there any method that might help solve this problem?

mike_r_black · November 29, 2019, 5:33am

This query would return tracks that have both genres

MATCH (g1:Genre {name: 'rock'})<-[:HAS_GENRE]-(t:Track)-[:HAS_GENRE]->(g2:Genre {name: 'metal'})
RETURN t

This would be the query for track that has one genre but not the other

MATCH (t:Track)-[:HAS_GENRE]->(g2:Genre {name: 'metal'})
WHERE NOT (:Genre {name: 'rock'})<-[:HAS_GENRE]-(t)
RETURN t

Is this what you're trying to achieve? I assume you have an index on Genre.name to help speed up the seeking of the genre nodes?

shash · November 29, 2019, 5:47am

Thanks again for the answer @mike_r_black.

I assume you have an index on Genre.name to help speed up the seeking of the genre nodes?

Yes.

Is this what you're trying to achieve?

Not exactly. A few followup questions:

If the query to find track with both genres fails, I want to find tracks that have either of the genres. These queries might not cover that use case right?
What if there are more than 2 genres I want to query on? That would definitely require something like IN right?

michael.hunger · November 29, 2019, 9:43pm

It's best to access properties as late as possible.
So a simplistic optimization would be:

MATCH (t:Track)-[:HAS_GENRE]->(g:Genre) 
WHERE g.name IN ['rock', 'metal']
WITH t, collect(g) AS genres, count(g) AS score ORDER BY score DESC LIMIT 20
RETURN t.name, [g in genres | g.name] as genres, score

But I presume you actually want to have all genres of each track, not just your 2.

MATCH (t:Track)-[:HAS_GENRE]->(g:Genre) WHERE g.name IN ['rock', 'metal']
WITH t, size( (t)-[:HAS_GENRE]->() ) as score
ORDER by score DESC LIMIT 20
MATCH (t)-[:HAS_GENRE]->(g) 
RETURN t.name, collect(g.name) AS genres, score

Topic		Replies	Views
RE: How to Aggregate calculation of data faster? Cypher	2	501	January 21, 2020
Query Tuning Help - Lots of DB hits? Cypher performance , cypher	3	1206	March 24, 2020
About db hits of Neo4j Cypher performance , cypher	2	1674	December 13, 2019
An effort to better understand the underlying mechanism of query filters and profiling Graph Data Science / Graph Analytics performance , cypher	12	1004	July 27, 2020
Extremely slow query when profile looks very good? Cypher	11	4167	October 3, 2019

Take the Course Then Join The Aura Agent Hackathon

Does count aggregation not use page cache?

Related topics

Take the Course Then Join
The Aura Agent Hackathon