Cluster leader election after election


(Tim Hanssen) #1

We have an issue that our causal cluster started to keep holding elections causing the cluster to be in a sort of read only mode. (leader changing so often). The cluster was running without issues or additional elections for 7 days before this interruption.

  • Ubuntu 18 LTS
  • Neo4j 3.4.10
  • BOLT (without routing)
  • Causal cluster (from 3 nodes)

debug log (neo03 (starting state: follower))
https://drive.google.com/file/d/1P-Aop1lzHhYciM6BqpItShSNRXZth0zc/view?usp=sharing

debug log (neo04 (starting state: leader))
https://drive.google.com/file/d/1vm4hNpIZfw4pj2GHCHYw7FRg_ogP6k4k/view?usp=sharing

Any suggestions?


(Alberto De Lazzari) #2

Hi Tim,
just to better understand your issue.
The first time you started the instances was the cluster correct? I mean an election was reached successfully within a certain amount of time with one leader and two followers.

The first time you started the instances, were all of them in an initial "follower" state? (you can check in your log files).
If not and something went wrong (maybe for an initial misconfiguration of the instances) please take a look at the "data" subdirectory of you Neo4j home directory. There should be a directory with some data about the cluster information, maybe you can try to delete this information from each instance and restart the all the servers.

In general, when starting from scratch, no instance will start as a leader. The proper state workflow should be: follower, candidate, leader and when a leader will be elected all the other instances will become followers.

I hope maybe helpful


(Tim Hanssen) #3

Hi Alberto,

The first time you started the instances was the cluster correct? I mean an election was reached successfully within a certain amount of time with one leader and two followers.

Yes, the cluster was running for about 7days without any issue. The leader did not change in those 7 days.

So the first time election was I think done properly. We restarted after 4 days one core server (follower) and after the restart it came back online as a follower without any problem. Until last night this happend.


(Alberto De Lazzari) #4

Ok, I will check through your logs to see if there is something useful.

The restart was related to a failure or maintenance?

Just a correction about what I told you before, when you have to clear out the state of a core member of the cluster instead of delete the directory, you can also use the "neo4j-admin unbind" command, it's the recommended way to do that.

Thanks


(Tim Hanssen) #5

Thxs!

The restarted was done for maintenance. Backup settings ect.

After this incident I restarted one follower node, that fixed the issue for now.


(Alberto De Lazzari) #6

Ok, in future, I think the best way to do any maintenance task on an instance is:

  • stop the instance
  • unbind the instance (that is delete the instance state using the neo4j-admin command)
  • do the tasks you have to perform
  • restart the instance

Once the instance will be up and running it will join the cluster with a correct state.


(Tim Hanssen) #7

We will, but the restart did not trigger this issue. This happend a few days later.