Unable to setup 3 node cluster

danjou.philippe · October 16, 2020, 3:06pm

Hi, I tried this 3 times now from scratch. I followed documentation. I have 3 nodes in a LAN, no firewalls.

relevant config lines:

dbms.mode=CORE

# Expected number of Core servers in the cluster at formation
causal_clustering.minimum_core_cluster_size_at_formation=3

# Minimum expected number of Core servers in the cluster at runtime.
causal_clustering.minimum_core_cluster_size_at_runtime=3

# A comma-separated list of the address and port for which to reach all other members of the cluster. It must be>
# host:port format. For each machine in the cluster, the address will usually be the public ip address of that m>
# The port will be the value used in the setting "causal_clustering.discovery_listen_address".
causal_clustering.initial_discovery_members=10.4.0.100:5000,10.4.0.101:5000,10.4.0.102:5000

# Host and port to bind the cluster member discovery management communication.
# This is the setting to add to the collection of address in causal_clustering.initial_core_cluster_members.
# Use 0.0.0.0 to bind to any network interface on the machine. If you want to only use a specific interface
# (such as a private ip address on AWS, for example) then use that ip address instead.
# If you don't know what value to use here, use this machines ip address.
causal_clustering.discovery_listen_address=10.4.0.100:5000

# Network interface and port for the transaction shipping server to listen on.
# Please note that it is also possible to run the backup client against this port so always limit access to it v>
# firewall and configure an ssl policy. If you want to allow for messages to be read from
# any network on this machine, us 0.0.0.0. If you want to constrain communication to a specific network address
# (such as a private ip on AWS, for example) then use that ip address instead.
# If you don't know what value to use here, use this machines ip address.
causal_clustering.transaction_listen_address=10.4.0.100:6000

# Network interface and port for the RAFT server to listen on. If you want to allow for messages to be read from
# any network on this machine, us 0.0.0.0. If you want to constrain communication to a specific network address
# (such as a private ip on AWS, for example) then use that ip address instead.
# If you don't know what value to use here, use this machines ip address.
causal_clustering.raft_listen_address=10.4.0.100:7000

I always end up with the following in logs:
NODE1

2020-10-16 14:57:26.258+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Election timeout triggered
2020-10-16 14:57:26.258+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Pre-election started with: PreVote.Request from MemberId{c98a8c4f} {term=1, candidate=MemberId{c98a8c4f}, lastAppended=1, lastLogTerm=1} and members: [MemberId{ed3a7be7}, MemberId{c98a8c4f}, MemberId{9195c6a9}]
2020-10-16 14:57:31.074+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Election timeout triggered
2020-10-16 14:57:31.074+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Pre-election started with: PreVote.Request from MemberId{c98a8c4f} {term=1, candidate=MemberId{c98a8c4f}, lastAppended=1, lastLogTerm=1} and members: [MemberId{ed3a7be7}, MemberId{c98a8c4f}, MemberId{9195c6a9}]
2020-10-16 14:57:31.939+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [neo4j] Pausing due to snapshot request (count = 1)
2020-10-16 14:57:31.939+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [neo4j] Resuming after snapshot request (count = 0)
2020-10-16 14:57:31.942+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [neo4j] Pausing due to snapshot request (count = 1)
2020-10-16 14:57:31.942+0000 INFO [c.n.c.c.s.CommandApplicationProcess] [neo4j] Resuming after snapshot request (count = 0)
2020-10-16 14:57:36.177+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Election timeout triggered
2020-10-16 14:57:36.177+0000 INFO [c.n.c.c.c.RaftMachine] [neo4j] Pre-election started with: PreVote.Request from MemberId{c98a8c4f} {term=1, candidate=MemberId{c98a8c4f}, lastAppended=1, lastLogTerm=1} and members: [MemberId{ed3a7be7}, MemberId{c98a8c4f}, MemberId{9195c6a9}]

Node2

h
2020-10-16 14:57:08.579+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:18.583+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:28.586+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:31.941+0000 INFO [c.n.c.c.s.s.SnapshotDownloader] [neo4j] Downloading snapshot from core server at 10.4.0.100:6000
2020-10-16 14:57:31.944+0000 ERROR [c.n.c.c.s.s.StoreDownloader] [neo4j] Store copy failed due to store ID mismatch
2020-10-16 14:57:38.589+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot

Node 3

2020-10-16 14:57:01.938+0000 ERROR [c.n.c.c.s.s.StoreDownloader] [neo4j] Store copy failed due to store ID mismatch
2020-10-16 14:57:08.736+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:18.739+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:28.742+0000 INFO [c.n.c.c.s.CoreSnapshotService] [neo4j] Waiting for another raft group member to publish a core state snapshot
2020-10-16 14:57:31.938+0000 INFO [c.n.c.c.s.s.SnapshotDownloader] [neo4j] Downloading snapshot from core server at 10.4.0.100:6000
2020-10-16 14:57:31.940+0000 ERROR [c.n.c.c.s.s.StoreDownloader] [neo4j] Store copy failed due to store ID mismatch

What am I doing wrong here? This should be easy and straight forward
I also tried the unbind thing with neo4j admin but it didn't change anything, shouldnt be required on a fresh install anyway?

Thanks for help

david_allen · October 18, 2020, 1:17pm

This is the problem.

Stop each server, run neo4j-admin unbind on each server, and then restart and it should be fixed.

The issue is that the cluster members think they have different databases, and they won't join and communicate if they have a "split brain"

danjou.philippe · October 18, 2020, 2:48pm

Yes I did this multiple times already. It doesn't help. (try yourself)

But I saw some other post I found on google someone saying to delete everything in the database dirs, so I did, and now it seems to work.

Update your Documentation! It's wrong!

harvey_nguyen · August 17, 2021, 10:20am

It seems you are right, I have the same issue and neo4j-admin unbind doesn't work.

Dongho · February 8, 2022, 11:33am

Thank God, you saved my night!

Dongho · February 8, 2022, 11:48am

Ah, but I couldn't access "neo4j" database. So I deleted data folder and start and get succeeded.

bishnu12 · February 15, 2022, 10:13pm

Not sure about others, but @david_allen ,

"Stop each server, run neo4j-admin unbind on each server, and then restart and it should be fixed."

this worked for me.

Thanks

Topic		Replies	Views
Causal Cluster not forming Neo4j Graph Platform	5	5667	October 18, 2018
Error while Set up a local Causal Cluster Cluster	3	1081	September 22, 2020
Causal Cluster in different hosts Neo4j Graph Platform	1	635	December 27, 2019
Cluster hanging on attempt to connect to the other cluster members Cluster	2	1288	May 27, 2019
Causal cluster Discovery port listen 0.0.0.0:5000 doesn't work Cluster cluster	0	953	June 26, 2019

Unable to setup 3 node cluster

Related topics