Hi, I have been digging around and trying to understand what would be the implications of having multiple databases as part of a segregated data tenancy business model.
To date we approached this using labels under a single database which has some advantages that we can also connect these different tenancies to enrich the graph without compromising data stewardship protection the labels can provide.
With the arrival of neo4j 4, a new possibility presented itself: hard separation of data in the form of separate databases, which can be even a compelling offering in a SaaS solution. We are considering it even though we would no longer be able to create direct relationships between tenancies, which can make our work more difficult if we want to provide some features that depend on those relationships. Moreover, I am assuming running graph algos across multiple databases won't be directly possible, which can lead to some challenges ahead when we want to extract value from a holistic view of the solution.
Now, having all of those considerations there are a few undocumented implication of going through that route (AFAIK, everything is here: Chapter 5. Manage databases):
How can this impact the amount of open file descriptors in linux? This is important to know, because we have only one neo4j instance. It makes sense for now and the near future that it stays this way. I predict we having to maintain many databases under one instance in a fairly accelerated growth pace.
How does this impact memory consumption? Having multiple databases add significant overhead? If so, under what conditions this overhead increase? The memory considerations explained in 13.1. Memory configuration are a 100% per database or the instance itself represents the bulk of memory consumption?
What's the impact in performance? Considering that there will be I/O on a different file structure possibly asynchronously, there are also the implications on CPU usage during I/O operations. Are there any benchmarks already available?
All of these have a huge impact and I would love to take the most informed decision. So far I was unable to find any online resources that explore those.
Finally what's the picture of label vs database, resource wise. I have a feeling that in practice, considering a single neo4j instance under the same hardware, labels will always be more performative than multi-database. This is very important as we want to keep our infra costs low at this moment.
I already understand that it will be easier to scale and is inherently more secure to use multiple databases, so that is not a point I would like to discuss here.
Thanks for any contributions to the discussion.