Causal_clustering : Runtime vs quorum

fme.mrc · May 4, 2019, 6:42am

We were setting a casual cluster with 4 nos of nodes.
but confusing part is "minimum core cluster size at runtime" vs quorum for writes on casual cluster.

david_allen · May 4, 2019, 6:08pm

Docs for that setting are here:

What this controls is how big the cluster has to be before it stops being writable and the cluster reverts to "read only". Say you have a 3 node cluster and one node dies. You have 2 left, which is still a quorum. minimum core cluster size at runtime is 2. Those two remaining nodes (one leader, one follower) can still process reads and writes. You're fine. Say ANOTHER failure happens. Now you're below the minimums. The remaining node will still be readable but won't process writes.

In general, I wouldn't recommend 4 nodes. In these docs (Introduction - Operations Manual) it gives an important forumla M = 2F + 1. Given M core nodes, you can tolerate F failures. If you want to tolerate 1 failure, you only need 3 nodes (example given above). If you want to tolerate 2 failures, you need 5 nodes. So going to 4 nodes doesn't buy you much extra -- in general you should stick to odd numbers of nodes and go to 5 if you need higher levels of HA. For example, in a multi-datacenter setup, you might put 3 in 1 data center, 2 in the other

andrew_bowman · May 4, 2019, 10:22pm

"minimum core cluster size at runtime" also has to do with cluster membership, which is related but separate from the number of nodes currently online in the cluster. The nodes that are members of the cluster (offline or not) are calculated into the number needed for quorum, and only these cluster members may take part in consensus operations (including voting in or out cluster members).

When you change the min cluster size at runtime you're telling the cluster that the membership list cannot drop below a certain number. So if you set this to 5, and cluster members started going offline for whatever reason, the membership list would never be able to drop below 5...the nodes may be unreachable, but no vote would take place to vote out the cluster members and reduce the cluster size. And if you lost quorum, then only those offline cluster members could be restored to regain quorum, you wouldn't be able to add new nodes to the cluster.

There was a related question on this earlier, I think you'll find my answer useful.

The tl;dr is that in nearly all cases the default is fine, you shouldn't need to mess around with this property except in exceptional situations (I don't think I've seen any, working in Neo4j customer support).

fme.mrc · May 6, 2019, 10:47am

Thanks so much for the response.

Now it makes perfect sense to me.

rakshit.sourabh21 · May 7, 2019, 9:30am

@andrew - i have a query on "minimum core cluster size at runtime" . if i am not wrong minimum number we can set is 2 , in that case minimum number of member in cluster cannot drop that 2 which will further decide the voting with available members but if we dont have minimum 3 nodes in cluster voting wont take place..

And
does it means 2f+1 and "minimum core cluster size at runtime" is directly proportional to each other :P . with 2 node failure cluster should have 5 nodes and "minimum core cluster size at runtime" as 3 (which maintains the availbale member for voting ) . same goes with 4 nodes and 1 failure tolerant as mentioned by @david_allen

andrew_bowman · May 7, 2019, 10:43am

While the minimum number you can set for min core size at runtime is 2, it really doesn't buy you anything over the default of 3. The number required for quorum is 2 members in both cases (since 1 out of 2 is not a majority, it must be 2 out of 2).

If you set min core cluster size at runtime to 2, then if you start with 3 nodes, and node 3 becomes unreachable, the members can vote out that member (we still have quorum, and we're allowed to make that vote because we're still above the min core cluster size at runtime). The cluster is now a 2-node cluster where the majority quorum is 2 nodes. We can still handle writes, we can still vote in any new node. If we lose 1 more node we lose quorum.

The scenario is similar in capability to if we use the default core cluster size at runtime of 3 (though without the ability to vote out a member and scale down the cluster size). If we have a 3 node cluster, and node 3 becomes unreachable, no vote-out takes place (since that would bring us below the min core cluster size at runtime of 3), majority quorum is 2 nodes (so we still have quorum and can handle writes and vote in any new node). If we lose 1 more node we lose quorum.

The only real difference is the means by which we can recover when we lose quorum.

If we have a cluster size of 2 (minimum 2) and we lose one of those two nodes, we've lost quorum and we need to restore that single offline node to regain it. That's the only option.

Vs when we have a cluster size of 3 (minimum 3, but 1 offline and unable to be voted out because that would drop us below the minimum) and we lose one of those two remaining nodes, we've lost quorum (we have 1 of 3 nodes online) and we need to restore either of those 2 offline nodes to regain it. We have a bit more flexibility on restoration, whether to restore offline node 2 or offline node 3.

andrew_bowman · May 7, 2019, 10:52am

As for failures, as long as you keep quorum (for the current cluster member size) and as long as you're above the min core cluster size at runtime, the cluster will scale down and the number required for quorum will change, and this may provide the ability to tolerate additional failures without losing quorum.

For example, for a cluster of size 5, we can tolerate 2 simultaneous failures (2f+1 = 5, so f = 2) without losing quorum. If we lost 3 at the same time, we would lose quorum (losing write capability and the ability to add or remove cluster members). But if we lose 2 members, since we still have quorum, that quorum can vote out those members from the cluster, downscaling the cluster to a new membership size of 3. Quorum for a cluster of 3 is 2 nodes, so at that point we can now afford 1 failure without losing quorum (2f + 1 = 3, so f = 1).

Basically when we have a higher (and odd) number of cluster members, the cluster is more robust, because not only can the cluster tolerate more simultaneous failures, it can also scale down as long as a majority quorum of the cluster members is maintained, which provides opportunity to tolerate additional failures.

Looking at a larger case, with a cluster size of 7, which can tolerate 3 simultaneous failures (but losing 4 at once would make it lose quorum), but if we have 3 or fewer failures, or the failures occur in a staggered manner, as long as quorum is maintained throughout, the cluster will scale down and will be able to tolerate more total failures.

One key thing to keep in mind is that for a node to participate in a raft operation (commit or vote in/out, requiring quorum), all previous raft events must have been committed. So when we have quorum and vote in or out members, a success implicitly guarantees that all previous raft operations (commits specifically) have been committed by the nodes that participated in that event.

shawnngtq · February 18, 2021, 7:37am

In general, I wouldn't recommend 4 nodes. In these docs (https://neo4j.com/docs/operations-manual/current/clustering/introduction/#causal-clustering-core-servers ) it gives an important forumla M = 2F + 1. Given M core nodes, you can tolerate F failures. If you want to tolerate 1 failure, you only need 3 nodes (example given above). If you want to tolerate 2 failures, you need 5 nodes. So going to 4 nodes doesn't buy you much extra -- in general you should stick to odd numbers of nodes and go to 5 if you need higher levels of HA. For example, in a multi-datacenter setup, you might put 3 in 1 data center, 2 in the other

@david_allen, @andrew_bowman

Using this example, say we only have 2 data centers (A, B) and 5 core nodes

Data center A: 3 core nodes
Data center B: 2 core nodes

Is it accurate to say that if data center B is down, the cluster is still quorum (read-write). But if data center A is down, then only read action is allow.

Basically, if you only have 2 data centers and if there is 50% chance either center goes down, then you have 50% chance you have quorum?

andrew_bowman · February 18, 2021, 6:58pm

Yes, that's correct.

Yes, that's correct. Also, with a 3/2 split like this, you cannot guarantee that when DC A goes down that DC B is caught up (since quorum is 3/5 nodes, it is possible that only DC A servers participated in the latest commit). If you had an even split between the DCs, then you would lose quorum in either case, but you would be guaranteed that the surviving DC would have at least one node caught up with the latest data. You would need to manually intervene to identify the caught up node, seed it to the others, shut down, unbind, and start the new cluster.

If you want to ensure you can tolerate the case where an entire DC goes down at once, then you need at least 3 DCs, with at a minimum 1 core per DC.

shawnngtq · February 19, 2021, 6:05am

@andrew_bowman, I see, so if there are only 2 DC, would you do a 3/2 split or 3/3 (even) split?

with a 3/2 split like this, you cannot guarantee that when DC A goes down that DC B is caught up (since quorum is 3/5 nodes, it is possible that only DC A servers participated in the latest commit)

Would this statement still stands if I add multiple read-only nodes to both DC A and DC B? Would any of the read-only node in DC B replicate the latest commit done in DC A in time since there is no write action involved?

andrew_bowman · February 19, 2021, 5:29pm

3/3 is more reliable (you cannot lose data due to a single DC failure), but it requires manual intervention to reseed/relaunch the nodes in the remaining DC to a new 3-node cluster. The ops team would have to be familiar with the steps necessary to handle this. Before that recovery, the cluster will be up but only able to serve read queries.

3/2 is a gamble that one DC is more likely to failure than another. If you win, then no manual intervention is required. If you lose then you risk data loss, and you won't be able to tell if you lost data without examining data on one of the offline nodes, so if the DC burned to the ground, you just won't know.

A balanced 3-DC deployment should allow for resiliency for a single DC failure, in a similar manner as a 3-node cluster tolerates a node failure. The downside there is that write commits must span all DCs.

Having read replicas may increase your chance to have a node in a surviving DC caught up, but as read replicas don't participate in raft commits, there are 0 guarantees here.

shawnngtq · March 16, 2021, 3:29am

@andrew_bowman

Adding on, I would like to check with you regarding replication factor.

M = 2F + 1

Replication factor = M - F

2 DC, 2/1 split, replication factor = 3 - 1 = 2
2 DC, 2/2 split, replication factor = Is it 2?
2 DC, 3/2 split, replication factor = 5 - 2 = 3
2 DC, 3/3 split, replication factor = is it 3?

I am not sure how to determine the replication factor for even number of core servers. Is my formula and assumption correct?

andrew_bowman · March 16, 2021, 8:02pm

The better way to view it is that Raft requires quorum commits (and this includes vote-in/vote-out for cluster membership), so you need a quorum of cluster membership, so an even split won't work.

For a cluster of size 2 or 3, quorum is 2.
For a cluster of size 4 or 5, quorum is 3.

Depending on the timing of failures, as long as quorum is maintained it is possible for offline members to be voted out and for the cluster to scale down, which may allow it to afford additional failures. Minimum cluster size is typically 3 members (in that the cluster won't try for a vote-out if they are at 3 members and one goes offline).

Topic		Replies	Views
Leadership election issue Cluster	2	2223	May 15, 2019
Understand high performance writing with causal cluster configuration Cluster	2	610	January 8, 2021
Understanding high availability and cluster installation in neo4j Cluster	6	4262	November 3, 2020
Casual Cluster Leader not electing Cluster operations	2	1040	September 10, 2019
Neo4j Causal Cluster : Backup Strategies Aura & Cloud backup , cluster	5	1648	October 9, 2019

Causal_clustering : Runtime vs quorum

Related topics