Hello!
I am trying to run very big database (several billions of nodes) on neo4j. I have created indexes and memrec shows:
- Lucene indexes: 0k
- Data volume and native indexes: 504g
I've got only 90GB of memory in my machine. So the question is: is it critical for indexes to fully fit into memory? Is it fine if I set about 50G for page cache?
In recent versions of Neo4j I'd probably start of with 8G for java heap (depending on your query's memory footprint), leave up some GB for the OS and dedicate 75G for pagecache.
Thank you for the fast reply!
Queries run very slow, and suspect there is not enough memory. How can I determine that there is enough pagecache on my machine?
PS I use community edition.
For slow queries, we suggest using EXPLAIN and PROFILE to look at what's going on in the query plan and ensuring you have the right indexes or constraints in place to support quick lookup of starting nodes.
You can also open a new question here on the Cypher section (including the expanded PROFILE plan) for any help streamlining them.
@andrew_bowman I have used PROFILE and it shows that it uses indexes, so I think the problem is not in the query. I suspect either too small page cache (I have set 54G) or not enough memory in the heap (I have set 18G). The queries are resource consuming, but I think I have powerful enough server (24 cores and 90G mem).
Can you provide the full query? We might see a way to troubleshoot based on that.
here is the query:
match
(r:Resource)-[:ns1__P31]-(:Resource{uri:"http://www.wikidata.org/entity/Q5"}),
(s)-[:ns8__P2048]-(r)-[:ns1__P21]-(g),
(s)-[:ns7__P2048]-(v),
(v)-[:ns0__quantityNormalized]-(q),
(q)-[:ns0__quantityUnit]-(:Resource{uri:"http://www.wikidata.org/entity/Q11573"}),
(r)-[:ns1__P106]-(o:Resource)
where q.ns0__quantityAmount < 3
with g.rdfs__label as gender, o.rdfs__label as occupation, avg(q.ns0__quantityAmount) as height, count(*) as cnt
where cnt > 100
return gender, occupation, height, cnt
order by height desc
It runs for about 600 secs for the first time and for about 30 for seconds further.
Using explicit labels for the variables should not really affect the speed, because all the nodes are of the same label 'Resource'
a simpler query
match
(o:Resource)-[:ns10__P2044]->(h:Resource)
where h.ns0__quantityAmount > 8000
match (r:Resource)-[:ns1__P31]->(:Resource{uri:"http://www.wikidata.org/entity/Q8502"}),
(r:Resource)-[:ns8__P2044]->(o:Resource)
return r.rdfs__label, h.ns0__quantityAmount as height, r.ns1__P625
order by height desc
runs for 78 secs for the first time.
Note that this is a test server and there is no load on it.
For that last query, can you expand all elements of the query plan (double-down arrow in the lower right corner of the result frame) and then add the expanded plan? That can better help associate steps of the query plan with certain parts of your query.
So the filter at the end, applying h:Resource
(and likely the `ns0__quantityAmount < 8000) resulted in the biggest drop in rows, from 313.4k to 16.
Out of curiosity, what happens if you use this query instead? Can you PROFILE this and give us the plan and timing?
MATCH (o:Resource)-[:ns10__P2044]->(h:Resource)
where h.ns0__quantityAmount > 8000
RETURN count(*)
I'm just curious what rows look like if we start here.
Also, do you have an index on :Resource(ns0__quantityAmount)? And if not, would it make sense to have one here, or do you not lookup by this property often enough to justify it?
As you can see from the profiling I have an index for quantity amount. You can also see that there are only 69 entities from 3 billion which satisfy the criteria.
First run is about 60 seconds, following are about 2 secs.
Note. I use neo4j 3.5.5
Maybe an important remark.
all the nodes have label Resource
, but they have different sets of properties, for example quantityAmount exists only in 1 302 670 nodes out of 3 billion. So I expect from indexes to work really fast, as they should not contain information about nodes which do not have corresponding properties.
Correct, if a :Resource node does not have an indexed property, it won't be present in that index, so that will work well for selectivity.
Okay, this looks promising, let's see if we can adjust the query to allow the planner to start here.
We're going to add a join hint on the resource with the uri, which should ensure we can expand to this node, rather than using that node as the starting point and expanding from it.
Try this out, verify that it starts with a NodeIndexSeekByRange, and add the profile plan and timing. I think this should work better.
MATCH (o:Resource)-[:ns10__P2044]->(h:Resource)
WHERE h.ns0__quantityAmount > 8000
MATCH (o)<-[:ns8__P2044]-(r:Resource)-[:ns1__P31]->(end:Resource{uri:"http://www.wikidata.org/entity/Q8502"})
USING JOIN ON end
RETURN r.rdfs__label, h.ns0__quantityAmount as height, r.ns1__P625
ORDER BY height desc
Keep in mind also that nodes can be multi-labeled. You mentioned that certain :Resource nodes have different sets of properties, that suggests that these nodes can serve different roles in certain contexts. You might want to consider if it makes sense to add appropriate labels to these nodes so you can address them in these different contexts, which should also serve to narrow down the selectivity and potentially speed up your index lookups and label filtering.
I am not sure if it worked. The query plan seems the same as earlier. I've got runing time of 5 sec for this query, but I think it is because it was already in cache.
I also got this warning from the engine:
The database was unable to plan a hinted join.
The hinted join was not planned. This could happen because no generated plan contained the join key, please try using a different join key or restructure your query. (hinted join key identifier is: end)
I do not really understand what is meant by no generated plan contained the join key
.
Adding more labels should definitely help, but in many cases it will not make the search narrower (as in this query).
So I try to understand why is it slow for not so large indexes. For example the query with count we made before even with not so optimal execution plan should be performed in hundreds if not dozens of milliseconds but not 60 sec (in my subjective feeling). That is why I started to suspect some problems with insufficient memory.
Odd that it's having trouble with that join hint. Let's give this one a try instead, and if it doesn't work see if it will accept it with both hints:
MATCH (o:Resource)-[:ns10__P2044]->(h:Resource)
USING index h:Resource(ns0__quantityAmount)
WHERE h.ns0__quantityAmount > 8000
MATCH (o)<-[:ns8__P2044]-(r:Resource)-[:ns1__P31]->(end:Resource{uri:"http://www.wikidata.org/entity/Q8502"})
//USING JOIN ON end
RETURN r.rdfs__label, h.ns0__quantityAmount as height, r.ns1__P625
ORDER BY height desc
That worked. First time running time is about 2 sec.
Both hints have the same time and execution plan.
So overall, do you think it is ok with memory and I should just adjust queries?
Do you think we can improve over 2 sec (not changing the db structure/adding multilabels), maybe using some special configuration property?
Fixing up the queries as you just did is the main thing. Remember that heap memory is like query workspace memory, so by streamlining the query and minimizing the db work, that means less activity in the heap, thus lower heap pressure.
As for the pagecache, that's going to be mostly for the native index cache and graph traversal. Now given with the sheer number of nodes, and the indexes on nodes of those labels, it's going to be challenging to be able to cover all of that with the pagecache, but in general I think the more selective you can make your indexes the better, and if you can figure out narrower labels to add in addition to :Resource, and add appropriate indexes, it should work out better for your lookups. And by narrowing down to a smaller set of rows as early as possible in the query, that should mean less graph data needs to be queried from the pagecache, which hopefully should mean less need for paging through the cache.
As always, if you can get more memory to your server, the better, but also remember that if he majority of your queries are hitting common areas of your graph, that should be good pagecache usage and fewer cache misses.
As for other query improvement ideas, you could try this at the beginning and see how it runs:
MATCH (h:Resource)
WHERE (h)<-[:ns10__P2044]-()
WITH h, true as ignored
...
After that section you can add add your WHERE clause back in, but we won't need the index hint as we're not performing an index lookup. What we're trying to figure out here is if looking up the number of :Resource nodes with an incoming relationship of this type is more efficient than the index lookup. It may not be, as you will have to touch all :Resource nodes to filter to your results, but a degree lookup per node is very cheap as well, so give it a try and see how selective this is comparatively.
Ok. Great! Thank you so much for so awesome advice and support! You do a great job!