Causal_clustering : Runtime vs quorum

We were setting a casual cluster with 4 nos of nodes.
but confusing part is "minimum core cluster size at runtime" vs quorum for writes on casual cluster.

Docs for that setting are here:

https://neo4j.com/docs/operations-manual/current/reference/configuration-settings/#config_causal_clustering.minimum_core_cluster_size_at_runtime

What this controls is how big the cluster has to be before it stops being writable and the cluster reverts to "read only". Say you have a 3 node cluster and one node dies. You have 2 left, which is still a quorum. minimum core cluster size at runtime is 2. Those two remaining nodes (one leader, one follower) can still process reads and writes. You're fine. Say ANOTHER failure happens. Now you're below the minimums. The remaining node will still be readable but won't process writes.

In general, I wouldn't recommend 4 nodes. In these docs (https://neo4j.com/docs/operations-manual/current/clustering/introduction/#causal-clustering-core-servers) it gives an important forumla M = 2F + 1. Given M core nodes, you can tolerate F failures. If you want to tolerate 1 failure, you only need 3 nodes (example given above). If you want to tolerate 2 failures, you need 5 nodes. So going to 4 nodes doesn't buy you much extra -- in general you should stick to odd numbers of nodes and go to 5 if you need higher levels of HA. For example, in a multi-datacenter setup, you might put 3 in 1 data center, 2 in the other

1 Like

"minimum core cluster size at runtime" also has to do with cluster membership, which is related but separate from the number of nodes currently online in the cluster. The nodes that are members of the cluster (offline or not) are calculated into the number needed for quorum, and only these cluster members may take part in consensus operations (including voting in or out cluster members).

When you change the min cluster size at runtime you're telling the cluster that the membership list cannot drop below a certain number. So if you set this to 5, and cluster members started going offline for whatever reason, the membership list would never be able to drop below 5...the nodes may be unreachable, but no vote would take place to vote out the cluster members and reduce the cluster size. And if you lost quorum, then only those offline cluster members could be restored to regain quorum, you wouldn't be able to add new nodes to the cluster.

There was a related question on this earlier, I think you'll find my answer useful.

The tl;dr is that in nearly all cases the default is fine, you shouldn't need to mess around with this property except in exceptional situations (I don't think I've seen any, working in Neo4j customer support).

1 Like

Thanks so much for the response.

Now it makes perfect sense to me.

@andrew - i have a query on "minimum core cluster size at runtime" . if i am not wrong minimum number we can set is 2 , in that case minimum number of member in cluster cannot drop that 2 which will further decide the voting with available members but if we dont have minimum 3 nodes in cluster voting wont take place..

And
does it means 2f+1 and "minimum core cluster size at runtime" is directly proportional to each other :P . with 2 node failure cluster should have 5 nodes and "minimum core cluster size at runtime" as 3 (which maintains the availbale member for voting ) . same goes with 4 nodes and 1 failure tolerant as mentioned by @david.allen

While the minimum number you can set for min core size at runtime is 2, it really doesn't buy you anything over the default of 3. The number required for quorum is 2 members in both cases (since 1 out of 2 is not a majority, it must be 2 out of 2).

If you set min core cluster size at runtime to 2, then if you start with 3 nodes, and node 3 becomes unreachable, the members can vote out that member (we still have quorum, and we're allowed to make that vote because we're still above the min core cluster size at runtime). The cluster is now a 2-node cluster where the majority quorum is 2 nodes. We can still handle writes, we can still vote in any new node. If we lose 1 more node we lose quorum.

The scenario is similar in capability to if we use the default core cluster size at runtime of 3 (though without the ability to vote out a member and scale down the cluster size). If we have a 3 node cluster, and node 3 becomes unreachable, no vote-out takes place (since that would bring us below the min core cluster size at runtime of 3), majority quorum is 2 nodes (so we still have quorum and can handle writes and vote in any new node). If we lose 1 more node we lose quorum.

The only real difference is the means by which we can recover when we lose quorum.

If we have a cluster size of 2 (minimum 2) and we lose one of those two nodes, we've lost quorum and we need to restore that single offline node to regain it. That's the only option.

Vs when we have a cluster size of 3 (minimum 3, but 1 offline and unable to be voted out because that would drop us below the minimum) and we lose one of those two remaining nodes, we've lost quorum (we have 1 of 3 nodes online) and we need to restore either of those 2 offline nodes to regain it. We have a bit more flexibility on restoration, whether to restore offline node 2 or offline node 3.

As for failures, as long as you keep quorum (for the current cluster member size) and as long as you're above the min core cluster size at runtime, the cluster will scale down and the number required for quorum will change, and this may provide the ability to tolerate additional failures without losing quorum.

For example, for a cluster of size 5, we can tolerate 2 simultaneous failures (2f+1 = 5, so f = 2) without losing quorum. If we lost 3 at the same time, we would lose quorum (losing write capability and the ability to add or remove cluster members). But if we lose 2 members, since we still have quorum, that quorum can vote out those members from the cluster, downscaling the cluster to a new membership size of 3. Quorum for a cluster of 3 is 2 nodes, so at that point we can now afford 1 failure without losing quorum (2f + 1 = 3, so f = 1).

Basically when we have a higher (and odd) number of cluster members, the cluster is more robust, because not only can the cluster tolerate more simultaneous failures, it can also scale down as long as a majority quorum of the cluster members is maintained, which provides opportunity to tolerate additional failures.

Looking at a larger case, with a cluster size of 7, which can tolerate 3 simultaneous failures (but losing 4 at once would make it lose quorum), but if we have 3 or fewer failures, or the failures occur in a staggered manner, as long as quorum is maintained throughout, the cluster will scale down and will be able to tolerate more total failures.

One key thing to keep in mind is that for a node to participate in a raft operation (commit or vote in/out, requiring quorum), all previous raft events must have been committed. So when we have quorum and vote in or out members, a success implicitly guarantees that all previous raft operations (commits specifically) have been committed by the nodes that participated in that event.