We run a 3 node causal cluster in Kubernetes, enterprise 3.5.3 version, with the following cluster configuration:
HOSTNAME=$(hostname)
ID=${HOSTNAME#neo4jtest-}
export NEO4J_dbms_connectors_default__advertised__address=10.132.2.${ID}
export NEO4J_causal__clustering_discovery__advertised__address=$(hostname -f):5000
export NEO4J_causal__clustering_transaction__advertised__address=10.132.2.${ID}:600
export NEO4J_causal__clustering_raft__advertised__address=$(hostname -f):7000
We have also configured readiness and liveness probes:
readinessProbe:
initialDelaySeconds: 60
httpGet:
path: /db/manage/server/core/available
port: 7474
livenessProbe:
initialDelaySeconds: 60
httpGet:
path: /db/manage/server/core/available
port: 7474
Today we discovered that we had inconsistent data in our test environment.
We have tracked down the root cause to be that the leader node went OutOfMemoryError
12 days ago that lead to the node disabling Hazelcast. The two follower nodes formed a 2 node cluster and elected a new leader, but according to :sysinfo
the old leader still believed it was the leader of a 3 node cluster with the two other nodes as followers. The debug log from the faulty node contains some statements that contradicts the :sysinfo
output though:
2019-04-12 07:59:19.231+0000 ERROR [o.n.b.r.ExecutorBoltScheduler] Unexpected error during job scheduling for session 'bolt-94767'. java.lang.OutOfMemoryError: Java heap space
java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: Java heap space
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:273)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:280)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1592)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.OutOfMemoryError: Java heap space
2019-04-12 07:59:19.552+0000 ERROR [o.n.b.t.p.HouseKeeper] Fatal error occurred when handling a client connection: [id: 0x095bb3af, L:/10.48.4.33:7687 ! R:/10.132.0.39:59536] Java heap space
java.lang.OutOfMemoryError: Java heap space
2019-04-12 07:59:19.553+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=540646, gcTime=0, gcCount=0}
2019-04-12 07:59:19.553+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:48578]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-04-12 07:59:19.554+0000 INFO [o.n.c.c.c.s.RaftState] Leader changed from MemberId{b5353fdb} to null
2019-04-12 07:59:19.696+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=314, gcTime=224, gcCount=2}
2019-04-12 07:59:19.718+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:48578]: client produced 301 messages on the worker queue, auto-read is being disabled.
2019-04-12 07:59:19.719+0000 WARN [o.n.b.r.BoltConnectionReadLimiter] Channel [/127.0.0.1:48578]: consumed messages on the worker queue below 100, auto-read is being enabled.
2019-04-12 07:59:19.851+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{3cf28095}[matchIndex: 1177136, lastSentIndex: 1177136, localAppendIndex: 1177136, mode: PIPELINE]
2019-04-12 07:59:19.851+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{c0c80264}[matchIndex: 1177136, lastSentIndex: 1177136, localAppendIndex: 1177136, mode: PIPELINE]
2019-04-12 07:59:19.852+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Inactive{}
2019-04-12 07:59:19.852+0000 INFO [o.n.c.c.r.RaftReplicator] Lost previous leader 'MemberId{b5353fdb}'. Currently no available leader
2019-04-12 07:59:19.852+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Step down event detected. This topology member, with MemberId MemberId{b5353fdb}, was leader in term 6481, now moving to follower.
2019-04-12 07:59:19.852+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Leader MemberId{b5353fdb} updating leader info for database default and term 6482
2019-04-12 07:59:19.853+0000 INFO [o.n.c.c.c.s.RaftState] First leader elected: MemberId{c0c80264}
2019-04-12 07:59:19.863+0000 INFO [o.n.c.c.r.RaftReplicator] A new leader has been detected: 'MemberId{c0c80264}'
2019-04-12 07:59:20.059+0000 WARN [o.n.k.i.c.VmPauseMonitorComponent] Detected VM stop-the-world pause: {pauseTime=116, gcTime=127, gcCount=1}
2019-04-12 07:59:20.090+0000 WARN [c.n.c.d.SslHazelcastCoreTopologyService] Uncaught exception Hazelcast instance is not active!
com.hazelcast.core.HazelcastInstanceNotActiveException: Hazelcast instance is not active!
at com.hazelcast.spi.impl.operationservice.impl.InvocationRegistry.shutdown(InvocationRegistry.java:182)
at com.hazelcast.spi.impl.operationservice.impl.OperationServiceImpl.shutdownInvocations(OperationServiceImpl.java:496)
at com.hazelcast.spi.impl.NodeEngineImpl.shutdown(NodeEngineImpl.java:434)
at com.hazelcast.instance.Node.shutdownServices(Node.java:456)
at com.hazelcast.instance.Node.shutdown(Node.java:430)
at com.hazelcast.instance.OutOfMemoryHandlerHelper.tryShutdown(OutOfMemoryHandlerHelper.java:61)
at com.hazelcast.instance.DefaultOutOfMemoryHandler.onOutOfMemory(DefaultOutOfMemoryHandler.java:87)
at com.hazelcast.instance.OutOfMemoryErrorDispatcher.onOutOfMemory(OutOfMemoryErrorDispatcher.java:189)
at com.hazelcast.internal.partition.impl.MigrationThread.run(MigrationThread.java:66)
at ------ submitted from ------.(Unknown Source)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolve(InvocationFuture.java:111)
at com.hazelcast.spi.impl.operationservice.impl.InvocationFuture.resolveAndThrow(InvocationFuture.java:74)
at com.hazelcast.spi.impl.AbstractInvocationFuture.get(AbstractInvocationFuture.java:158)
at com.hazelcast.multimap.impl.MultiMapProxySupport.invoke(MultiMapProxySupport.java:252)
at com.hazelcast.multimap.impl.MultiMapProxySupport.getAllInternal(MultiMapProxySupport.java:77)
at com.hazelcast.multimap.impl.ObjectMultiMapProxy.get(ObjectMultiMapProxy.java:115)
at org.neo4j.causalclustering.discovery.HazelcastClusterTopology.toCoreMemberMap(HazelcastClusterTopology.java:356)
at org.neo4j.causalclustering.discovery.HazelcastClusterTopology.getCoreTopology(HazelcastClusterTopology.java:105)
at org.neo4j.causalclustering.discovery.HazelcastCoreTopologyService.refreshCoreTopology(HazelcastCoreTopologyService.java:385)
at org.neo4j.causalclustering.discovery.HazelcastCoreTopologyService.refreshTopology(HazelcastCoreTopologyService.java:375)
at org.neo4j.causalclustering.helper.RobustJobSchedulerWrapper.withErrorHandling(RobustJobSchedulerWrapper.java:51)
at org.neo4j.causalclustering.helper.RobustJobSchedulerWrapper.lambda$scheduleRecurring$1(RobustJobSchedulerWrapper.java:41)
at org.neo4j.kernel.impl.scheduler.ScheduledJobHandle.lambda$new$0(ScheduledJobHandle.java:83)
at org.neo4j.kernel.impl.scheduler.ThreadPool.lambda$submit$0(ThreadPool.java:57)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2019-04-12 07:59:20.799+0000 WARN [o.n.k.i.p.Procedures] Error during iterate.commit:
2019-04-12 07:59:20.800+0000 WARN [o.n.k.i.p.Procedures] 1 times: java.lang.NullPointerException
The other two nodes logged:
2019-04-12 13:49:12.712+0000 INFO [o.n.c.m.RaftOutbound] No address found for MemberId{b5353fdb}, probably because the member has been shut down.
Also note that http://localhost:7474/db/manage/server/core/available
must have returned OK on the faulty node, otherwise Kubernetes would have restarted it.
Last Tx Id
was 300980
on the original leader, and 596152
on the other two nodes. After restarting the old leader node, it managed to join the cluster and sync up.
As far as we were able to see from the timestamps in the debug output from the DiagnonsticsManager
, the original leader had not received any writes after the OutOfMemoryError
, but we don't know how we can confirm that. Would a write to the old leader end up in an error when propagating the write to the new leader?
We are a bit scared that the cluster got into this state. As far as we have experienced with some large queries and updates, Neo4j tries to handle OutOfMemoryError, but in this case it looks like Hazelcast's shutdown handler was invoked. According to the release notes of 3.4, it was disabled in 3.4.0-alpha09, but it is obviously active in 3.5.3.
How can this scenario be prevented? It looks like the only safe thing to do is always restart a Neo4j node if OutOfMemoryError
occurs.
Complete debug.log
from the faulty node can be provided on request.