I must admit I'm quite baffled at this. I've gone through quite a bit of work adding some extra nodes to help make - what would be a really bad range query - profile quite nicely. But it runs incredibly slowly the first time, and subsequently quite fast. I presume this is due to disk access and loading the page cache, but still, the number of nodes being examined is quite small.
The example query is ast the end of this post along with the plan output from the profile.
To give some context, this is a large graph (~5B nodes and ~20B relationships) of genomics data. About 2 TB of data total in the store.
The :OverlapRegion
is just a fixed, 100K region of the genome (indexed by chromosome) that can be linked from. Since Neo4j doesn't have true compound indexes, this allows us to quickly zero in on a region and then fan out to only the features within that region.
In the query below, it uses an index on chromosome to correctly narrow down just the 1,351 :OverlapRegion
nodes to look at and find the 6 that are either within - or envelope - the range cared about. That should be extremely fast and is.
Next, just find all the variant relationships (and are also limited to the region in question). There aren't that many variants linked to them and it went through 11,211 relationships to find the 9,837 that were within the bound in question.
I would expect this to be blazing fast on a 16-CPU machine with 64 GB of RAM. Admittedly, this is community edition, so only 4 CPUs are being used, but still. 1,351 + 11,211 = 12,562 nodes being filtered wouldn't 10 seconds on a 486DX even if I forgot to press the turbo button.
Again, every subsequent run of the query takes ~20 ms. So I have to believe this is a caching/disk-access issue. There are over 50 million :Variant
nodes in the graph, so it's possible Neo4j is seeking all over the place just to read the 11,211 it needs to. But, I'm unsure how to prove that (or - more importantly - how to rectify it if so). Similarly, once it's cached, all I have to do is run the same query on a different chromosome and it's ungodly slow again. Even going back it's as if it unloaded what it knew about previously and is slow again.
And our configuration is:
dbms.memory.heap.initial_size=23g
dbms.memory.heap.max_size=23g
dbms.memory.pagecache.size=27400m
With 27 GB of page cache, I should be able to have every :OverlapRegion
, every :Variant
, and more loaded into memory without a second thought and plenty of room to spare for whatever else comes along. Assuming this is a cache issue, is there a way for me to tell Neo4j to pre-warm the cache and to load - and keep loaded - particular node labels in memory?
Any insights here very much appreciated. Also, if anyone could comment as to whether or not this is an issue that would be magically solved by using the Enterprise version, I'd love to know that (and what those feature(s) would be), as we're considering making the leap in the coming months.
Thanks!
match (o:OverlapRegion)
where (o.chromosome = '11') and
((o.start >= 11000000 and o.start < 11500000) or
(o.end >= 11000000 and o.end < 11500000) or
(o.start < 11000000 and o.end >= 11500000))
match (o)-[:OVERLAPS_VARIANT]->(v:Variant)
where 11000000 <= v.position < 11500000
return count(v)