cancel
Showing results forΒ 
Search instead forΒ 
Did you mean:Β 

Join the community at Nodes 2022, our free virtual event on November 16 - 17.

Memory and Performance implications of a multiple database deployment of neo4j 4

francofs
Node Link

Hi, I have been digging around and trying to understand what would be the implications of having multiple databases as part of a segregated data tenancy business model.

To date we approached this using labels under a single database which has some advantages that we can also connect these different tenancies to enrich the graph without compromising data stewardship protection the labels can provide.

With the arrival of neo4j 4, a new possibility presented itself: hard separation of data in the form of separate databases, which can be even a compelling offering in a SaaS solution. We are considering it even though we would no longer be able to create direct relationships between tenancies, which can make our work more difficult if we want to provide some features that depend on those relationships. Moreover, I am assuming running graph algos across multiple databases won't be directly possible, which can lead to some challenges ahead when we want to extract value from a holistic view of the solution.

Now, having all of those considerations there are a few undocumented implication of going through that route (AFAIK, everything is here: Chapter 5. Manage databases😞

  1. How can this impact the amount of open file descriptors in linux? This is important to know, because we have only one neo4j instance. It makes sense for now and the near future that it stays this way. I predict we having to maintain many databases under one instance in a fairly accelerated growth pace.

  2. How does this impact memory consumption? Having multiple databases add significant overhead? If so, under what conditions this overhead increase? The memory considerations explained in 13.1. Memory configuration are a 100% per database or the instance itself represents the bulk of memory consumption?

  3. What's the impact in performance? Considering that there will be I/O on a different file structure possibly asynchronously, there are also the implications on CPU usage during I/O operations. Are there any benchmarks already available?

All of these have a huge impact and I would love to take the most informed decision. So far I was unable to find any online resources that explore those.

Finally what's the picture of label vs database, resource wise. I have a feeling that in practice, considering a single neo4j instance under the same hardware, labels will always be more performative than multi-database. This is very important as we want to keep our infra costs low at this moment.
I already understand that it will be easier to scale and is inherently more secure to use multiple databases, so that is not a point I would like to discuss here.

Thanks for any contributions to the discussion.
Best Regards,

FΓ‘bio

2 REPLIES 2

YMA-MDL
Node

Hi Fabio, did you get any answer maybe through other medias/discussions?

Hi @YMA-MDL , @jim.webber replied to me in his 4.0 GA annoucement: Introducing Neo4j Graph Database 4.0 [GA Release]

But still a bit difficult to grasp as projecting shared vs unshared resources is a bit difficult in an initial project phase where these are mostly unknowns and metrics are mostly guess work.

What I would love to see are a few example scenarios where one would have more advantage over the other approach. It is then easier to correlate with my use cases and would have a more effective way to balance trade-offs (including functional ones as consequence of disconnected graphs).

Given hardware can come in many flavors and sizes, being able to visualize at which point the "unshared" resources could become a bottleneck and start tipping the performance needle towards a single database would be mostly insightful.

So to conclude, while this was answered I still feel there should be a better guide on that area on Neo4j Docs.