tldr: Big neo4j dataset, trying to load into page cache. Only a part of the db being loaded. Queries with count very slow. Sysinfo and memrec showing different db data volumes. How to load the entire graph into memory, and reduce db hit counts?
I have a 25G 3.5.12 Enterprise Neo4j Database, running on a 52G server. Currently I have set the page cache to be 28G, and heap size to be 12G in the conf file. I want to load the entire dataset into cache for fast querying.
The problem:
1. Even after warming up with apoc.warmup.run(true, true, true), my queries are still hitting the database millions of times. I have also tried the basic MATCH (n)-[r]-(m) RETURN count(r), count(n.name) etc., but the same problem persists.
In general what you'll see with the page cache is that as you do queries, data will be loaded from disk and put into the page cache as long as you have space for it in the page cache. Warmup is a good start but it doesn't guarantee to fill the page cache per se.
What I'd be curious to see is after your warmup, what Halin says about your page cache utilization %.
"In general what you'll see with the page cache is that as you do queries, data will be loaded from disk and put into the page cache as long as you have space for it in the page cache. Warmup is a good start but it doesn't guarantee to fill the page cache per se."
After working on the problem through the day I realized that my graph.db store size is indeed 4.4G, and index is 3.3G, justifying 7.7G from memrec. The rest is transaction logs. This means the entire dataset is in the page cache already?
The problem then seems to be with the count aggregation:
MATCH (t:Track)-[:HAS_GENRE]->(g:Genre)
WHERE g.name IN ['rock', 'metal']
RETURN t.name, collect(DISTINCT g.name) AS genres, count(DISTINCT g) AS score ORDER BY score DESC LIMIT 20
If I remove count and collect from here, the query returns instantly, but these aggregations are hitting db millions of times.
Any ideas on what might be going wrong here? How can I optimize this?
Notice that, even though the genre names seem to be cached, but still every single time the db is being hit millions of times, even when so much page cache is still available to use?