How data is stored across servers?

Hi, I wanted to know that whether data is stored at a single place and shared by both core servers and read replicas or like both servers contain whole data or if both stores some metadata and actual data is stored somewhere else?

Hi,

Every Server in the Causal Cluster will have the exact same (and complete) data. Every database is a complete replica.

Cheers,
Ron

1 Like

So, for example we have terabytes(say 10 TB) of data and 3 core servers and 10 read replicas. We will have to store complete data at each server. So, we will require a 10 TB storage at each server and a total of 130 TB of storage. Isn't this a very big drawback?

That's no longer true with Neo4j Fabric, Ron. Read Sharding Graph Data with Neo4j Fabric - Developer Guides for more info.

While that is technically correct, it would be a naive way to look at the size of data.

Neo4J tends to store data in normalized fashion with data stored in pre-joined fashion. Even in RDBMS world, when you are looking at performance (time taken to get some data), you tend to de-normalize the data to avoid joins.

Where as Elasticserach or other document DB's store data completely de-normalized. Since the don't care about how data is connected and each document can be stored anywhere and data can be distributed. The same data if you store in an Neo4J it could be way smaller than the amount of storage required to store same information.

See this blog article how Adobe went down from 50TB of data in Cassandra to 40 GB in Neo4J.

https://neo4j.com/graphconnect-2018/session/overhauling-legacy-systems-adobe

Each mechanism gives different way to query the data. If you want to analyze the data connectivity Neo4J can be your solution. If you want store data and query and retrieve only documents then Document DB's will serve the purpose better.

Hope this gives a good overview on how Neo4J can be used. It would be good idea to discuss how your data is structured and what would be it's size in Neo4J rather than looking at the source data sizes.

1 Like

What @anthapu says it exactly right. You cannot translate disk space in a 1 to 1 fashion. Using the correct database and database model for the problem could change the disk space a lot.
Next to that It could mean that you need way less servers to handle the required load.

If that is not enough, the awesome new version of Neo4j gives you the ability to shard your data.
Sharding a graph database is hard. Because a graph is connected, it is hard in an automated way to split the data in components that belong to each other.

Neo4j version 4.0 provides you (as domain expert) the tools to determine the correct way to separate the data. If that is needed. To get more details about this follow the link @rvanbruggen provided above.

Have fun!

Cheers,
Ron

1 Like

@anthapu @rweverwijk @rvanbruggen Thanks for your help.... I studied about fabrics and a new question arises that how fabrics work with multiple servers....let's take an example again. Let's say we have a 10 TB data, 3 core servers and 10 read replicas and we divide the data into 10 shards of 1 TB each. So this sharded data will be stored somewhere common or will each core server and read replica will have their own sharded data?

Harshit - could I ask you where you are based? I will get the local team to reach out to you then - as this is clearly about a piece of Enterprise Edition functionality that would require some kind of a license.

Look forward to hearing from you!

Cheers

Rik

Hello Rik,
I am not working as an employee at any organisation currently.
However, I heard from experienced acquaintances, how Graph Databases are the way forward for so many organisations so I am currently investing my time and resources to learn them and hopefully join one and contribute.

Thanks for the help.

Hi @anthapu @rweverwijk @rvanbruggen,
Can someone help me with my question above? I tried looking for a good solution for that but had no luck in that.

I think this is pretty well described in https://neo4j.com/docs/operations-manual/current/fabric/introduction/#_multi_cluster_deployment. Each shard will have it's own set of core/edge servers - a fully redundant Neo4j cluster. These cluster members can be distributed across data centers for true availability.

Having said that: what use case are you evaluating this for? This kind of setup seems extremely advanced, and rarely truly necessary.

Hi @rvanbruggen, I am a little bit confused regarding this..... It will be very helpful if you can answer with a rough example i mentioned above

@rweverwijk @rvanbruggen Even I have the same doubt regarding storage of data across servers and even how sharded data is stored. It will be very helpful if you can explain using a rough example.

Hi Harshit,
Say you have 10 TB of data and you want data to to 1 TB shard each and you want a 3 node cluster with RR then the setup would be like this.

Shard 1 : 3 node core cluster 1 RR
Shard 2 : 3 node core cluster 1 RR
.....

So, in total you will have 40 servers.

Think each shard as independent DB with 1 TB data. Each DB needs to be in it's own cluster. you cannot have a database that spans more than one cluster.

Thanks
Ravi

Hi anthapu,
Here you have mentioned that shard 1 : 3 node core cluster 1 RR and the size of shard is 1 TB. So these 4 servers ( 3 node and 1 RR ) will have the whole shard of data stored or like that data will be distributed and stored by them?

All the nodes ( 3 cores and 1 RR) will have have exact same data of the shard. There is no data distribution.

In short think each shard is a separate database independent on it's own, which is a partition of a big database.