How data is stored across servers?

harshit_cr07 · February 17, 2020, 11:26am

Hi, I wanted to know that whether data is stored at a single place and shared by both core servers and read replicas or like both servers contain whole data or if both stores some metadata and actual data is stored somewhere else?

rweverwijk · February 17, 2020, 1:04pm

Hi,

Every Server in the Causal Cluster will have the exact same (and complete) data. Every database is a complete replica.

Cheers,
Ron

harshit_cr07 · February 17, 2020, 3:33pm

So, for example we have terabytes(say 10 TB) of data and 3 core servers and 10 read replicas. We will have to store complete data at each server. So, we will require a 10 TB storage at each server and a total of 130 TB of storage. Isn't this a very big drawback?

rvanbruggen · February 17, 2020, 5:17pm

That's no longer true with Neo4j Fabric, Ron. Read Sharding Graph Data with Neo4j Fabric - Developer Guides for more info.

anthapu · February 17, 2020, 5:23pm

While that is technically correct, it would be a naive way to look at the size of data.

Neo4J tends to store data in normalized fashion with data stored in pre-joined fashion. Even in RDBMS world, when you are looking at performance (time taken to get some data), you tend to de-normalize the data to avoid joins.

Where as Elasticserach or other document DB's store data completely de-normalized. Since the don't care about how data is connected and each document can be stored anywhere and data can be distributed. The same data if you store in an Neo4J it could be way smaller than the amount of storage required to store same information.

See this blog article how Adobe went down from 50TB of data in Cassandra to 40 GB in Neo4J.

https://neo4j.com/graphconnect-2018/session/overhauling-legacy-systems-adobe

Each mechanism gives different way to query the data. If you want to analyze the data connectivity Neo4J can be your solution. If you want store data and query and retrieve only documents then Document DB's will serve the purpose better.

Hope this gives a good overview on how Neo4J can be used. It would be good idea to discuss how your data is structured and what would be it's size in Neo4J rather than looking at the source data sizes.

rweverwijk · February 17, 2020, 7:07pm

What @anthapu says it exactly right. You cannot translate disk space in a 1 to 1 fashion. Using the correct database and database model for the problem could change the disk space a lot.
Next to that It could mean that you need way less servers to handle the required load.

If that is not enough, the awesome new version of Neo4j gives you the ability to shard your data.
Sharding a graph database is hard. Because a graph is connected, it is hard in an automated way to split the data in components that belong to each other.

Neo4j version 4.0 provides you (as domain expert) the tools to determine the correct way to separate the data. If that is needed. To get more details about this follow the link @rvanbruggen provided above.

Have fun!

Cheers,
Ron

harshit_cr07 · February 18, 2020, 5:02am

@anthapu @rweverwijk @rvanbruggen Thanks for your help.... I studied about fabrics and a new question arises that how fabrics work with multiple servers....let's take an example again. Let's say we have a 10 TB data, 3 core servers and 10 read replicas and we divide the data into 10 shards of 1 TB each. So this sharded data will be stored somewhere common or will each core server and read replica will have their own sharded data?

rvanbruggen · February 18, 2020, 7:49am

Harshit - could I ask you where you are based? I will get the local team to reach out to you then - as this is clearly about a piece of Enterprise Edition functionality that would require some kind of a license.

Look forward to hearing from you!

Cheers

Rik

harshit_cr07 · February 18, 2020, 8:42am

Hello Rik,
I am not working as an employee at any organisation currently.
However, I heard from experienced acquaintances, how Graph Databases are the way forward for so many organisations so I am currently investing my time and resources to learn them and hopefully join one and contribute.

Thanks for the help.

harshit_cr07 · February 20, 2020, 4:44am

Hi @anthapu @rweverwijk @rvanbruggen,
Can someone help me with my question above? I tried looking for a good solution for that but had no luck in that.

rvanbruggen · February 20, 2020, 7:45am

I think this is pretty well described in https://neo4j.com/docs/operations-manual/current/fabric/introduction/#_multi_cluster_deployment. Each shard will have it's own set of core/edge servers - a fully redundant Neo4j cluster. These cluster members can be distributed across data centers for true availability.

Having said that: what use case are you evaluating this for? This kind of setup seems extremely advanced, and rarely truly necessary.

harshit_cr07 · February 23, 2020, 8:02am

Hi @rvanbruggen, I am a little bit confused regarding this..... It will be very helpful if you can answer with a rough example i mentioned above

deepakworld258566 · February 28, 2020, 10:01am

@rweverwijk @rvanbruggen Even I have the same doubt regarding storage of data across servers and even how sharded data is stored. It will be very helpful if you can explain using a rough example.

anthapu · February 28, 2020, 3:45pm

Hi Harshit,
Say you have 10 TB of data and you want data to to 1 TB shard each and you want a 3 node cluster with RR then the setup would be like this.

Shard 1 : 3 node core cluster 1 RR
Shard 2 : 3 node core cluster 1 RR
.....

So, in total you will have 40 servers.

Think each shard as independent DB with 1 TB data. Each DB needs to be in it's own cluster. you cannot have a database that spans more than one cluster.

Thanks
Ravi

harshit_cr07 · March 11, 2020, 3:58am

Hi anthapu,
Here you have mentioned that shard 1 : 3 node core cluster 1 RR and the size of shard is 1 TB. So these 4 servers ( 3 node and 1 RR ) will have the whole shard of data stored or like that data will be distributed and stored by them?

anthapu · March 11, 2020, 11:54am

All the nodes ( 3 cores and 1 RR) will have have exact same data of the shard. There is no data distribution.

In short think each shard is a separate database independent on it's own, which is a partition of a big database.

Topic		Replies	Views
Cluster, core server, read replica, data storage Cluster	5	2980	March 4, 2021
Distributed data storage in Neo4j Newbie Questions	2	834	June 12, 2019
Large connected data distribution? Server	1	361	February 18, 2020
Causal clustering Neo4j Graph Platform	1	327	April 13, 2020
Data storage among the Read Replicas Cluster	4	1940	December 18, 2018

How data is stored across servers?

Related topics