Distributed data storage in Neo4j

Given that data sharding is not straightforward in the case of graph data structure. How does Neo4j distribute the data in its servers for horizontal scaling of data.

It doesn't. Neo4j Causal Cluster is a full graph replication approach.

Note that upcoming releases might have support for application level provided partitioning.

1 Like

One concept that I believe I heard mentioned at GraphConnect was the idea of reading from different replicas that will have different subgraphs in their page cache. The idea was to make it a sort of "soft sharding", where you could reach beyond your own "shard" of the data when needed but your main working set could remain in memory.

I don't have enough data to warrant this sort of optimization yet but I could imagine it working pretty well if node clusters don't have much overlap. For example, when virtually all of a customer's data is scoped to their own account, you could route all queries for a given customer id to the same replica every time and you're more likely to get a cache hit.

And changing how you route those queries could be done with significantly less effort. Since you're not changing where the data physically lives there's no real shard migration you need to do, you're just tweaking where you read them from. It'd run slower for a little while as the caches adjust, but I'd bet it would catch up reasonably quickly.