Two leaders in cluster after OutOfMemoryError

We run a 3 node causal cluster in Kubernetes, enterprise 3.5.3 version, with the following cluster configuration:

export NEO4J_dbms_connectors_default__advertised__address=10.132.2.${ID}
export NEO4J_causal__clustering_discovery__advertised__address=$(hostname -f):5000
export NEO4J_causal__clustering_transaction__advertised__address=10.132.2.${ID}:600
export NEO4J_causal__clustering_raft__advertised__address=$(hostname -f):7000

We have also configured readiness and liveness probes:

          initialDelaySeconds: 60
            path: /db/manage/server/core/available
            port: 7474
          initialDelaySeconds: 60
            path: /db/manage/server/core/available
            port: 7474

Today we discovered that we had inconsistent data in our test environment.

We have tracked down the root cause to be that the leader node went OutOfMemoryError 12 days ago that lead to the node disabling Hazelcast. The two follower nodes formed a 2 node cluster and elected a new leader, but according to :sysinfo the old leader still believed it was the leader of a 3 node cluster with the two other nodes as followers. The debug log from the faulty node contains some statements that contradicts the :sysinfo output though:

2019-04-12 07:59:19.231+0000 ERROR [o.n.b.r.ExecutorBoltScheduler] Unexpected error during job scheduling for session 'bolt-94767'. java.lang.OutOfMemoryError: Java heap space
java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: Java heap space
	at java.util.concurrent.CompletableFuture.encodeThrowable(
	at java.util.concurrent.CompletableFuture.completeThrowable(
	at java.util.concurrent.CompletableFuture$
	at java.util.concurrent.ThreadPoolExecutor.runWorker(
	at java.util.concurrent.ThreadPoolExecutor$
Caused by: java.lang.OutOfMemoryError: Java heap space
2019-04-12 07:59:19.552+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0x095bb3af, L:/ ! R:/] Java heap space
java.lang.OutOfMemoryError: Java heap space
2019-04-12 07:59:19.553+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=540646, gcTime=0, gcCount=0}
2019-04-12 07:59:19.553+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-04-12 07:59:19.554+0000 INFO [o.n.c.c.c.s.RaftState] Leader changed from MemberId{b5353fdb} to null
2019-04-12 07:59:19.696+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=314, gcTime=224, gcCount=2}
2019-04-12 07:59:19.718+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/]: client produced 301 messages on the worker queue, auto-read is being disabled.
2019-04-12 07:59:19.719+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-04-12 07:59:19.851+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{3cf28095}[matchIndex: 1177136, lastSentIndex: 1177136, localAppendIndex: 1177136, mode: PIPELINE]
2019-04-12 07:59:19.851+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{c0c80264}[matchIndex: 1177136, lastSentIndex: 1177136, localAppendIndex: 1177136, mode: PIPELINE]
2019-04-12 07:59:19.852+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Inactive{}
2019-04-12 07:59:19.852+0000 INFO [o.n.c.c.r.RaftReplicator] Lost previous leader 'MemberId{b5353fdb}'. Currently no available leader
2019-04-12 07:59:19.852+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Step down event detected. This topology member, with MemberId MemberId{b5353fdb}, was leader in term 6481, now moving to follower.
2019-04-12 07:59:19.852+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Leader MemberId{b5353fdb} updating leader info for database default and term 6482
2019-04-12 07:59:19.853+0000 INFO [o.n.c.c.c.s.RaftState] First leader elected: MemberId{c0c80264}
2019-04-12 07:59:19.863+0000 INFO [o.n.c.c.r.RaftReplicator] A new leader has been detected: 'MemberId{c0c80264}'
2019-04-12 07:59:20.059+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=116, gcTime=127, gcCount=1}
2019-04-12 07:59:20.090+0000 WARN [c.n.c.d.SslHazelcastCoreTopologyService] Uncaught exception Hazelcast instance is not active!
com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
	at com.hazelcast.spi.impl.operationservice.impl.InvocationRegistry.shutdown(
	at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.shutdownInvocations(
	at com.hazelcast.spi.impl.NodeEngineImpl.shutdown(
	at com.hazelcast.instance.Node.shutdownServices(
	at com.hazelcast.instance.Node.shutdown(
	at com.hazelcast.instance.OutOfMemoryHandlerHelper.tryShutdown(
	at com.hazelcast.instance.DefaultOutOfMemoryHandler.onOutOfMemory(
	at com.hazelcast.instance.OutOfMemoryErrorDispatcher.onOutOfMemory(
	at ------ submitted from ------.(Unknown Source)
	at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(
	at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(
	at com.hazelcast.spi.impl.AbstractInvocationFuture.get(
	at com.hazelcast.multimap.impl.MultiMapProxySupport.invoke(
	at com.hazelcast.multimap.impl.MultiMapProxySupport.getAllInternal(
	at com.hazelcast.multimap.impl.ObjectMultiMapProxy.get(
	at org.neo4j.causalclustering.discovery.HazelcastClusterTopology.toCoreMemberMap(
	at org.neo4j.causalclustering.discovery.HazelcastClusterTopology.getCoreTopology(
	at org.neo4j.causalclustering.discovery.HazelcastCoreTopologyService.refreshCoreTopology(
	at org.neo4j.causalclustering.discovery.HazelcastCoreTopologyService.refreshTopology(
	at org.neo4j.causalclustering.helper.RobustJobSchedulerWrapper.withErrorHandling(
	at org.neo4j.causalclustering.helper.RobustJobSchedulerWrapper.lambda$scheduleRecurring$1(
	at org.neo4j.kernel.impl.scheduler.ScheduledJobHandle.lambda$new$0(
	at org.neo4j.kernel.impl.scheduler.ThreadPool.lambda$submit$0(
	at java.util.concurrent.Executors$
	at java.util.concurrent.ThreadPoolExecutor.runWorker(
	at java.util.concurrent.ThreadPoolExecutor$
2019-04-12 07:59:20.799+0000 WARN [o.n.k.i.p.Procedures] Error during iterate.commit:
2019-04-12 07:59:20.800+0000 WARN [o.n.k.i.p.Procedures] 1 times: java.lang.NullPointerException

The other two nodes logged:

2019-04-12 13:49:12.712+0000 INFO [o.n.c.m.RaftOutbound] No address found for MemberId{b5353fdb}, probably because the member has been shut down.

Also note that http://localhost:7474/db/manage/server/core/available must have returned OK on the faulty node, otherwise Kubernetes would have restarted it.

Last Tx Id was 300980 on the original leader, and 596152 on the other two nodes. After restarting the old leader node, it managed to join the cluster and sync up.

As far as we were able to see from the timestamps in the debug output from the DiagnonsticsManager, the original leader had not received any writes after the OutOfMemoryError, but we don't know how we can confirm that. Would a write to the old leader end up in an error when propagating the write to the new leader?

We are a bit scared that the cluster got into this state. As far as we have experienced with some large queries and updates, Neo4j tries to handle OutOfMemoryError, but in this case it looks like Hazelcast's shutdown handler was invoked. According to the release notes of 3.4, it was disabled in 3.4.0-alpha09, but it is obviously active in 3.5.3.

How can this scenario be prevented? It looks like the only safe thing to do is always restart a Neo4j node if OutOfMemoryError occurs.

Complete debug.log from the faulty node can be provided on request.

Due to the nature of causal clustering, notably quorum commits, a transaction can only commit when the majority of the cluster can also commit. Even in the event that the old leader could still contact the other members of the cluster, the term for the commit would be older than the term being used for the 2-node cluster after the new leader was elected, and would be rejected.

In other words, a single isolated node in the cluster, even if it thinks it's the leader, has no ability to commit any transactions that could go to it, since quorum consensus for the latest leadership term is required for that to happen.

And yes after an OutOfMemoryException restarting is the best way forward.

I'll see if engineering can take a glance and offer any additional comments.

1 Like

Thank you very much for the swift reply. It is reassuring that the faulty leader would not have succeeded any commits. We will take steps to monitor for OutOfMemoryError.

By the way, we have set environment variable NUMBER_OF_CORES to 3 in Kubernetes, per various tutorials and examples, but we cannot find any reference documentation for this.

We have not set NEO4J_causal__clustering_minimum__core__cluster__size__at__runtime, which per documentation defaults to 3, but the 2 remaining nodes formed an operational writeable cluster. This is how we would like the cluster to behave (as 2 still is majority of 3), but I would feel more confident if I knew how the default of 3 is overridden.

I am looking forward to hear from engineering why Hazelcast shut down on the faulty node.

The defaults for the min core cluster size at runtime should be fine.

This is a bit of a confusing topic, but I'll try to summarize.

In causal clusters, you can think of cluster membership like a VIP list, and only cluster members are allowed to take part in consensus operations, like commits. The number of members on the list is used to calculate majority quorum (for 3 members, majority is 2...for 4 or 5 members, majority is 3, etc). It is possible for a node to be a cluster member, but be unreachable or offline, and it still counts as a cluster member when it comes to calculating the majority.

The membership list can increase or decrease as members are added and removed. The thing to remember is that the voting in or out of members requires a majority quorum vote to succeed.

So if you have a 3-node cluster and another node is added, a majority (2) is needed to vote to add that member to the membership list and be counted as a cluster member (and the majority number will be updated to 3). If a member becomes unreachable or shuts down, even though it may be unreachable, a vote must take place first to remove that node's cluster membership, so 3 nodes must be online and able to vote out the member (and the cluster size would decrease to 3, majority becomes 2).

The minimum core cluster size at runtime (default 3) means the membership list cannot shrink below 3 members. If one node of the 3-node cluster leaves or is unreachable, there will not be a vote to remove the member since that would lower the cluster members below the minimum. Therefore the node that left is still counted as a cluster member, even if it is offline.

This is also why when you lose quorum (for example 2 out of 3 cluster members offline) that the only way to regain quorum is to bring back one of those offline members. You cannot add a brand new node to the cluster to recover, as quorum is required to vote in new cluster members.

And just to note, that is all by design in the Raft protocol, as this behavior is critical to maintaining data durability in the cluster. If we allowed brand new cluster members to be added even when we have lost quorum, although it would be a very fast way to regain write capability, it would risk scenarios that can result in losing previous commits to the database.

1 Like

Thank your for the clarification! I understand that I mixed cluster size and quorum majority when reading the documentation here.

We will leave the settings to default. :slight_smile: