Causal Cluster not forming


(Michael) #1

I am trying to form a causal cluster of 2 cores, but not having any luck.

  • Security groups and NACLs are wide open.
  • Tried DNS discovery, and am now trying LIST with the public IPs of the other cores. (port 5000)
  • I have signed certificates and a cluster SSL policy defined.
  • I have pinged each server from the other and there is connectivity.
  • nmap reports Neo4j is listening on port 5000.
  • each core has its public ip as the advertised address
  • I have 3 instances trying to form a cluster with a minimum core count of 2
  • 2 of the servers where built by hand, the third was a clone of the first and then had its neo4j.conf updated

The 3 cores startup just fine and then they all wait on the "Attempting to connect to the other cluster members before continuing..." line.

I am at a loss of what to try next. Any help would be appreciated!

Edit:

I just tried again with two neo4j cores with the following conf:

dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=10.0.0.169
dbms.mode=CORE
causal_clustering.minimum_core_cluster_size_at_formation=2
causal_clustering.minimum_core_cluster_size_at_runtime=2
causal_clustering.initial_discovery_members=10.0.0.187:5000,10.0.0.169:5000
causal_clustering.discovery_listen_address=0.0.0.0:5000
causal_clustering.transaction_listen_address=0.0.0.0:6000
causal_clustering.raft_listen_address=0.0.0.0:7000
causal_clustering.raft_advertised_address=10.0.0.169:7000
causal_clustering.transaction_advertised_address=10.0.0.169:6000

The only thing different with the second core's conf is the different ip addresses.
There is no ssl policy.

Both machines then hang on the following log ouput:

Oct 17 13:15:38 ip-10-0-0-187 neo4j[5279]: 2018-10-17 13:15:38.253+0000 INFO  ======== Neo4j 3.4.9 ========
Oct 17 13:15:38 ip-10-0-0-187 neo4j[5279]: 2018-10-17 13:15:38.312+0000 INFO  Starting...
Oct 17 13:15:41 ip-10-0-0-187 neo4j[5279]: 2018-10-17 13:15:41.833+0000 INFO  Initiating metrics...
Oct 17 13:15:42 ip-10-0-0-187 neo4j[5279]: 2018-10-17 13:15:42.175+0000 INFO  My connection info: [
Oct 17 13:15:42 ip-10-0-0-187 neo4j[5279]:         Discovery:   listen=0.0.0.0:5000, advertised=10.0.0.187:5000,
Oct 17 13:15:42 ip-10-0-0-187 neo4j[5279]:         Transaction: listen=0.0.0.0:6000, advertised=10.0.0.187:6000,
Oct 17 13:15:42 ip-10-0-0-187 neo4j[5279]:         Raft:        listen=0.0.0.0:7000, advertised=10.0.0.187:7000,
Oct 17 13:15:42 ip-10-0-0-187 neo4j[5279]:         Client Connector Addresses: bolt://10.0.0.187:7687,http://10.0.0.187:7474,https://10.0.0.187:7473
Oct 17 13:15:42 ip-10-0-0-187 neo4j[5279]: ]
Oct 17 13:15:42 ip-10-0-0-187 neo4j[5279]: 2018-10-17 13:15:42.177+0000 INFO  Discovering cluster with initial members: [10.0.0.187:5000, 10.0.0.169:5000]

I don't understand why they aren't forming a cluster. Any help?


(M. David Allen) #2

Please check your debug.log files, which are usually in /var/log/neo4j, and grep them for stuff like "cluster". At the bottom you'll find various messages related to members joining or leaving the cluster. Basically, you need to take a look at what each of the machines is seeing in this debug.log file to figure out whether the issue is that they're not finding one another, or if the issue is something else.

There are several possible causes and things you might want to try to remedy common problems:

  • Stop neo4j and run neo4j-admin unbind on each machine to remove cluster state (there are various ways this can get out of whack). This will not affect data in the database, just cluster state (nodes will forget their buddies and reform fresh)
  • Port 5000 isn't enough, you need 6000 and 7000 too (see relevant conf file entries)

If those "quick fixes" don't do it for you, the solution is going to be starting with clean cluster state (after neo4j-admin unbind) and then taking a look at log dumps of debug.log from all three nodes. Try those easy steps first and then past follow-up logs if they don't resolve it.


(Michael) #3

Still nothing.

Tried building a small 2 CORE cluster and getting identical results. did neo4j-admin unbind on both.

log from core1:

Oct 17 18:25:17 ip-10-0-0-187 systemd[1]: Started Neo4j Graph Database.Oct 17 18:25:18 ip-10-0-0-187 neo4j[6581]: Active database: graph.dbOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]: Directories in use:Oct 17 18:25:18 ip-10-0-0-187 neo4j[6581]:   home:         /var/lib/neo4jOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]:   config:       /etc/neo4jOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]:   logs:         /var/log/neo4jOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]:   plugins:      /var/lib/neo4j/pluginsOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]:   import:       /var/lib/neo4j/importOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]:   data:         /var/lib/neo4j/dataOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]:   certificates: /var/lib/neo4j/certificatesOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]:   run:          /var/run/neo4jOct 17 18:25:18 ip-10-0-0-187 neo4j[6581]: Starting Neo4j.Oct 17 18:25:18 ip-10-0-0-187 neo4j[6581]: 2018-10-17 18:25:18.868+0000 INFO  ======== Neo4j 3.4.9 ========2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -12018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.585+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: []2018-10-17 18:34:16.769+0000 INFO [o.n.c.n.Server] raft-server: bound to 0.0.0.0:70002018-10-17 18:34:16.783+0000 INFO [o.n.c.i.ClusterBinder] Waiting for 2 members. Currently discovered 0 members: {}.2018-10-17 18:34:18.293+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /10.0.0.187:7000 remote /10.0.0.169:559302018-10-17 18:34:21.800+0000 INFO [o.n.c.i.ClusterBinder] Waiting for 2 members. Currently discovered 0 members: {}.2018-10-17 18:34:25.582+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Cluster discovery service started2018-10-17 18:34:16.581+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] log started with recovered state State{prevIndex=-1, prevTerm=-1, appendIndex=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -1                                                                                                                                                                                                                               ]}}, {memberId=MemberId{36d05839}, inf2018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -12018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.585+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: []2018-10-17 18:34:16.769+0000 INFO [o.n.c.n.Server] raft-server: bound to 0.0.0.0:7000                                                                                                                                                                                                                                          73, groups=[]}}]}2018-10-17 18:34:16.783+0000 INFO [o.n.c.i.ClusterBinder] Waiting for 2 members. Currently discovered 0 members: {}.2018-10-17 18:34:18.293+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /10.0.0.187:7000 remote /10.0.0.169:559302018-10-17 18:34:21.800+0000 INFO [o.n.c.i.ClusterBinder] Waiting for 2 members. Currently discovered 0 members: {}.                                                                                                                                                                                                           ]}}], removed=[]}2018-10-17 18:34:16.581+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] log started with recovered state State{prevIndex=-1, prevTerm=-1, appendIndex=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -12018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -12018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.581+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] log started with recovered state State{prevIndex=-1, prevTerm=-1, appendIndex=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -12018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -12018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.585+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: []2018-10-17 18:34:16.769+0000 INFO [o.n.c.n.Server] raft-server: bound to 0.0.0.0:70002018-10-17 18:34:16.783+0000 INFO [o.n.c.i.ClusterBinder] Waiting for 2 members. Currently discovered 0 members: {}.2018-10-17 18:34:16.581+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] log started with recovered state State{prevIndex=-1, prevTerm=-1, appendIndex=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -12018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -12018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.581+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] log started with recovered state State{prevIndex=-1, prevTerm=-1, appendIndex=-1}
2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}
2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -1
2018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}
2018-10-17 18:34:16.581+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] log started with recovered state State{prevIndex=-1, prevTerm=-1, appendIndex=-1}
2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}
2018-10-17 18:34:16.582+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -1
2018-10-17 18:34:16.584+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}2018-10-17 18:34:16.585+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: []
2018-10-17 18:34:16.769+0000 INFO [o.n.c.n.Server] raft-server: bound to 0.0.0.0:7000
2018-10-17 18:34:16.783+0000 INFO [o.n.c.i.ClusterBinder] Waiting for 2 members. Currently discovered 0 members: {}.
2018-10-17 18:34:18.293+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /10.0.0.187:7000 remote /10.0.0.169:55930
2018-10-17 18:34:21.800+0000 INFO [o.n.c.i.ClusterBinder] Waiting for 2 members. Currently discovered 0 members: {}.2018-10-17 18:34:25.582+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Cluster discovery service started
2018-10-17 18:34:25.595+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core topology changed {added=[{memberId=MemberId{2da5b542}, info=CoreServerInfo{raftServer=10.0.0.187:7000, catchupServer=10.0.0.187:6000, clientConnectorAddresses=bolt://10.0.0.187:7687,http://10.0.0.187:7474,https://10.0.0.187:7473, groups=[]}}, {memberId=MemberId{36d05839}, info=CoreServerInfo{raftServer=10.0.0.169:7000, catchupServer=10.0.0.169:6000, clientConnectorAddresses=bolt://10.0.0.169:7687,http://10.0.0.169:7474,https://10.0.0.169:7473, groups=[]}}], removed=[]}
2018-10-17 18:34:25.595+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: [MemberId{36d05839}, MemberId{2da5b542}]2018-10-17 18:34:25.607+0000 INFO [o.n.c.i.ClusterBinder] Bound to cluster: ClusterId{uuid=9c62e550-1c06-42a1-86d6-c2d028dab3fb}
2018-10-17 18:35:33.963+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core member removed MembershipEvent {member=Member [10.0.0.169]:5000 - 6e9e38e4-e52e-4988-8896-10a25d5004e7,type=removed}
2018-10-17 18:35:33.966+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core topology changed {added=[], removed=[{memberId=MemberId{36d05839}, info=CoreServerInfo{raftServer=10.0.0.169:7000, catchupServer=10.0.0.169:6000, clientConnectorAddresses=bolt://10.0.0.169:7687,http://10.0.0.169:7474,https://10.0.0.169:7473, groups=[]}}]}
2018-10-17 18:35:33.966+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: [MemberId{2da5b542}]
2018-10-17 18:35:41.091+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core member added MembershipEvent {member=Member [10.0.0.169]:5000 - 409ffd39-9d7f-4c84-8392-757d75aa70e3,type=added}
2018-10-17 18:35:41.093+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core topology changed {added=[{memberId=MemberId{36d05839}, info=CoreServerInfo{raftServer=10.0.0.169:7000, catchupServer=10.0.0.169:6000, clientConnectorAddresses=bolt://10.0.0.169:7687,http://10.0.0.169:7474,https://10.0.0.169:7473, groups=[]}}], removed=[]}
2018-10-17 18:35:41.093+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: [MemberId{36d05839}, MemberId{2da5b542}]

log from core2:

2018-10-17 18:25:26.490+0000 INFO [o.n.k.i.i.l.NativeLabelScanStore] Rebuilding label index, this may take a while2018-10-17 18:25:26.521+0000 INFO [o.n.k.i.i.l.NativeLabelScanStore] Label index rebuilt (roughly 0 nodes)2018-10-17 18:25:26.544+0000 INFO [o.n.k.i.DatabaseHealth] Database health set to OK2018-10-17 18:25:26.837+0000 INFO [o.n.c.c.s.m.t.LastCommittedIndexFinder] Last transaction id in metadata store 22018-10-17 18:25:26.839+0000 INFO [o.n.c.c.s.m.t.LastCommittedIndexFinder] Start id of last committed transaction in transaction log 12018-10-17 18:25:26.839+0000 INFO [o.n.c.c.s.m.t.LastCommittedIndexFinder] Last committed transaction id in transaction log 22018-10-17 18:25:26.839+0000 INFO [o.n.c.c.s.m.t.LastCommittedIndexFinder] Last committed consensus log index committed into tx log -12018-10-17 18:25:26.839+0000 INFO [o.n.c.c.s.m.t.ReplicatedTransactionStateMachine] Updated lastCommittedIndex to -12018-10-17 18:25:26.840+0000 INFO [o.n.c.c.s.m.t.ReplicatedTokenStateMachine] (Label) Updated lastCommittedIndex to -1
2018-10-17 18:25:26.840+0000 INFO [o.n.c.c.s.m.t.ReplicatedTokenStateMachine] (RelationshipType) Updated lastCommittedIndex to -12018-10-17 18:25:26.841+0000 INFO [o.n.c.c.s.m.t.ReplicatedTokenStateMachine] (PropertyKey) Updated lastCommittedIndex to -12018-10-17 18:25:26.841+0000 INFO [o.n.c.c.s.CommandApplicationProcess] Restoring last applied index to 02018-10-17 18:25:26.841+0000 INFO [o.n.c.c.s.CommandApplicationProcess] Resuming after startup (count = 0)2018-10-17 18:25:26.852+0000 INFO [o.n.c.n.Server] catchup-server: bound to 0.0.0.0:60002018-10-17 18:25:26.856+0000 INFO [o.n.c.n.Server] backup-server: bound to 127.0.0.1:63622018-10-17 18:25:36.108+0000 INFO [o.n.c.c.c.RaftMachine] Election timeout triggered2018-10-17 18:25:36.110+0000 INFO [o.n.c.c.c.RaftMachine] Election started with vote request: Vote.Request from MemberId{36d05839} {term=1, candidate=MemberId{36d05839}, lastAppended=0, lastLogTerm=0} and members: [MemberId{78abaec6}, MemberId{36d05839}]
2018-10-17 18:25:36.110+0000 INFO [o.n.c.c.c.RaftMachine] Moving to CANDIDATE state after successfully starting election2018-10-17 18:25:36.119+0000 INFO [o.n.c.m.SenderService] Creating channel to: [10.0.0.187:7000]2018-10-17 18:25:36.133+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Scheduling handshake (and timeout) local null remote null2018-10-17 18:25:36.153+0000 INFO [o.n.c.m.SenderService] Connected: [id: 0xbb6666ed, L:/10.0.0.169:55592 - R:/10.0.0.187:7000]2018-10-17 18:25:36.160+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Initiating handshake local /10.0.0.169:55592 remote /10.0.0.187:70002018-10-17 18:25:36.237+0000 INFO [o.n.c.p.h.HandshakeClientInitializer] Installing: ProtocolStack{applicationProtocol=RAFT_1, modifierProtocols=[]}2018-10-17 18:25:36.273+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /10.0.0.169:7000 remote /10.0.0.187:405762018-10-17 18:25:36.318+0000 INFO [o.n.c.c.c.RaftMachine] Moving to LEADER state at term 1 (I am MemberId{36d05839}), voted for by [MemberId{78abaec6}]
2018-10-17 18:25:36.318+0000 INFO [o.n.c.c.c.s.RaftState] First leader elected: MemberId{36d05839}2018-10-17 18:25:36.336+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Starting log shipper: MemberId{78abaec6}[matchIndex: -1, lastSentIndex: 0, localAppendIndex: 1, mode: MISMATCH]
2018-10-17 18:25:36.344+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Idle{}2018-10-17 18:25:36.345+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Leader MemberId{36d05839} updating leader info for database default and term 12018-10-17 18:25:37.384+0000 INFO [o.n.c.p.h.HandshakeServerInitializer] Installing handshake server local /10.0.0.169:6000 remote /10.0.0.187:428142018-10-17 18:25:54.698+0000 INFO [o.n.c.c.c.RaftMachine] Moving to FOLLOWER state after not receiving heartbeat responses in this election timeout period. Heartbeats received: []2018-10-17 18:25:54.698+0000 INFO [o.n.c.c.c.s.RaftState] Leader changed from MemberId{36d05839} to null2018-10-17 18:25:54.698+0000 INFO [o.n.c.c.c.s.RaftLogShipper] Stopping log shipper MemberId{78abaec6}[matchIndex: -1, lastSentIndex: 1, localAppendIndex: 1, mode: MISMATCH]2018-10-17 18:25:54.699+0000 INFO [o.n.c.c.c.m.RaftMembershipChanger] Inactive{}2018-10-17 18:25:54.699+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Step down event detected. This topology member, with MemberId MemberId{36d05839}, was leader in term 1, now moving to follower.2018-10-17 18:25:54.858+0000 INFO [o.n.c.c.c.m.MembershipWaiter] Leader commit unknown
2018-10-17 18:26:03.354+0000 INFO [o.n.c.c.c.RaftMachine] Election timeout triggered
2018-10-17 18:26:03.355+0000 INFO [o.n.c.c.c.RaftMachine] Election started with vote request: Vote.Request from MemberId{36d05839} {term=2, candidate=MemberId{36d05839}, lastAppended=1, lastLogTerm=1} and members: [MemberId{78abaec6}, MemberId{36d05839}]
2018-10-17 18:26:03.355+0000 INFO [o.n.c.c.c.RaftMachine] Moving to CANDIDATE state after successfully starting election
2018-10-17 18:26:15.792+0000 INFO [o.n.c.c.c.RaftMachine] Failed to get elected. Got votes from: []
2018-10-17 18:26:15.793+0000 INFO [o.n.c.c.c.RaftMachine] Election started with vote request: Vote.Request from MemberId{36d05839} {term=3, candidate=MemberId{36d05839}, lastAppended=1, lastLogTerm=1} and members: [MemberId{78abaec6}, MemberId{36d05839}]
2018-10-17 18:26:22.859+0000 INFO [o.n.c.c.c.m.MembershipWaiter] Leader commit unknown
2018-10-17 18:26:29.484+0000 INFO [o.n.c.c.c.RaftMachine] Failed to get elected. Got votes from: []
2018-10-17 18:26:29.485+0000 INFO [o.n.c.c.c.RaftMachine] Election started with vote request: Vote.Request from MemberId{36d05839} {term=4, candidate=MemberId{36d05839}, lastAppended=1, lastLogTerm=1} and members: [MemberId{78abaec6}, MemberId{36d05839}]
2018-10-17 18:26:40.941+0000 INFO [o.n.c.c.c.RaftMachine] Failed to get elected. Got votes from: []
2018-10-17 18:26:40.941+0000 INFO [o.n.c.c.c.RaftMachine] Election started with vote request: Vote.Request from MemberId{36d05839} {term=5, candidate=MemberId{36d05839}, lastAppended=1, lastLogTerm=1} and members: [MemberId{78abaec6}, MemberId{36d05839}]

These instances are the simplest possible install. Just java and neo4j-enterprise.


(M. David Allen) #4

Can you check through these settings?

https://neo4j.com/docs/operations-manual/current/clustering/settings/

A causal cluster of 2 is an odd size, they're typically minimum three. Note that 2 is possible but may clash with the default values of causal_clustering.minimum_core_cluster_size_at_formation and causal_clustering.minimum_core_cluster_size_at_runtime

In your logs, core 1 says:

Waiting for 2 members - I'm not 100% sure if this means 2 others, or 2 total.


(Michael) #5

The cluster of 2 thing is just to prove I know how to stand up a cluster. My prod env will be 3 cores on 3 different AZs with 3 readers on the same 3 AZs. The readers will all be in an autoscale group.

For some reason when i pasted the log into the reply in condensed a bunch of entries on one line. Both cores do know of the other. When they boot up, STDOUT shows them list out the core members.

I can only assume my neo4j.conf file is missing something, but for the life of me I haven't found anything different.

I looked through the settings reference. I have everything applicable except
causal_clustering.discovery_type
which defaults to LIST, as can be seen in the STDOUT when neo4j lists the cluster members. Both my cores are aware of the other.

My first post contains a block of config settings that I append to the end of the default neo4j.conf file.

Thank you for looking into this.


(Michael) #6

So after posting the above I added that conf line I claimed defaulted to LIST to both my cores. It worked. That's what I was missing this whole time. On my honor, I thought I had tried that but I guess not. My original attempts where to use DNS and after a time, times, and half a time of trouble shooting I started getting ride of conf options like ssl. I also changed to LIST but obviously missed that line. After looking back on all my images, I see that causal_clustering.discovery_type doesn't appear to be in the default neo4j.conf file. Whenever I started a new image, I didn't put that line in thinking the commented out lines had the options I needed. My mistake.

To further make this troubleshooting difficult, Neo4j outputs to STDOUT that it knows the other cluster members because it give you their IP addresses.

Thanks for the tip. I can move on now.