cancel
Showing results for 
Search instead for 
Did you mean: 

Slow to form cluster after node instance restart

maprokes
Node

Hello,

I have 3 nodes cluster running in K8 (GKE), using the official helm-charts. All works very well, but from time to time I run into issues after restarting one of the nodes.
There are 10 graphs in the cluster, all of them are relatively small, just few k of nodes. The problem that I noticed is that usually it takes 1-2 minutes to restart the node and re-join the cluster.
But sometimes it gets stuck waiting for a snapshot

[c.n.c.c.s.CoreSnapshotService] [riotrecommendations6e707beadc635df065390062291002a5/bb88d792] Waiting for another raft group member to publish a core state snapshot
[c.n.c.c.s.CoreSnapshotService] [riotrecommendations86111ba0b10ed2d1e895820b6fe6b790/cb2eefac] Waiting for another raft group member to publish a core state snapshot

Sometimes it fixes itself in a few minutes, sometimes it remains the same for hours or until restart. Restarting the node usually helps.

I'm mostly using default configs from the helm-chart.
Any idea where to look for potential logs/issues?

This is an unrelated consequence that I observed: I noticed, that sometimes 8 out of 10 graphs is synced and ready. But the pod is yet not accessible via service because the readinessProbe fails (waits for 7687). So in case, that the restarted node is LEAD for one of the ready graphs, it will be un-routable within the k8 network as the pod is not in the ready state, causing obviously troubles to clients trying to write to the db.

0 REPLIES 0
Nodes 2022
Nodes
NODES 2022, Neo4j Online Education Summit

On November 16 and 17 for 24 hours across all timezones, you’ll learn about best practices for beginners and experts alike.