Causal cluster - Neo4j not running but it is?

Hi Everyone! I'm having this little issue with a cluster I'm trying to create with 3 instances. The cluster seems to be created correctly in the logs, the neo4j processes are running but none of the members accept http or bolt connections.
This is the last entry on the logs from all the members (changing the ips):

2019-05-23 21:00:12.413+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Cluster discovery service starting
2019-05-23 21:00:12.438+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] My connection info: [
        Discovery:   listen=0.0.0.0:5000, advertised=172.31.0.78:5000,
        Transaction: listen=0.0.0.0:6000, advertised=172.31.0.78:6000, 
        Raft:        listen=0.0.0.0:7000, advertised=172.31.0.78:7000, 
        Client Connector Addresses: bolt://172.31.0.78:7687,http://172.31.0.78:7474,https://172.31.0.78:7473
]
2019-05-23 21:00:12.438+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Discovering other core members in initial members set: [172.31.0.76:5000, 172.31.0.77:5000, 172.31.0.78:5000]
2019-05-23 21:00:12.482+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] log started with recovered state State{prevIndex=-1, prevTerm=-1, appendIndex=-1}
2019-05-23 21:00:12.482+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}
2019-05-23 21:00:12.483+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -1
2019-05-23 21:00:12.484+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}
2019-05-23 21:00:12.484+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: []
2019-05-23 21:00:12.557+0000 INFO [o.n.c.n.Server] raft-server: bound to 0.0.0.0:7000
2019-05-23 21:00:21.146+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Cluster discovery service started
2019-05-23 21:00:21.173+0000 INFO [o.n.c.d.CoreMonitor] Bound to cluster with id f2123d79-01c7-4bdf-a118-b96ebe5dc762
2019-05-23 21:00:21.320+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core topology changed {added=[{memberId=MemberId{16939e8d}, info=CoreServerInfo{raftServer=172.31.0.76:7000, catchupServer=172.31.0.76:6000, clientConnectorAddresses=bolt://172.31.0.76:7687,http://172.31.0.76:7474,https://172.31.0.76:7473, groups=[], database=default, refuseToBeLeader=false}}, {memberId=MemberId{3c8b291f}, info=CoreServerInfo{raftServer=172.31.0.78:7000, catchupServer=172.31.0.78:6000, clientConnectorAddresses=bolt://172.31.0.78:7687,http://172.31.0.78:7474,https://172.31.0.78:7473, groups=[], database=default, refuseToBeLeader=false}}], removed=[]}
2019-05-23 21:00:21.320+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: [MemberId{16939e8d}, MemberId{3c8b291f}]
2019-05-23 21:00:21.328+0000 INFO [o.n.c.d.CoreMonitor] Discovered core member at 172.31.0.76:5000
2019-05-23 21:00:26.074+0000 INFO [o.n.c.d.CoreMonitor] Discovered core member at 172.31.0.77:5000
2019-05-23 21:00:26.077+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core topology changed {added=[{memberId=MemberId{1a32017e}, info=CoreServerInfo{raftServer=172.31.0.77:7000, catchupServer=172.31.0.77:6000, clientConnectorAddresses=bolt://172.31.0.77:7687,http://172.31.0.77:7474,https://172.31.0.77:7473, groups=[], database=default, refuseToBeLeader=false}}], removed=[]}
2019-05-23 21:00:26.077+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: [MemberId{16939e8d}, MemberId{1a32017e}, MemberId{3c8b291f}]

These are the cluster configs:

dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=172.31.0.78
dbms.mode=CORE
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3
causal_clustering.initial_discovery_members=172.31.0.76:5000,172.31.0.77:5000,172.31.0.78:5000

When I run cypher-shell in any of the instances:

# cypher-shell
Connection refused

Neo4j status in all the members says it's not running:

# neo4j status
Neo4j is not running

And yet I can see in all the instances that neo4j IS running. What else can I check?
Thanks! Any help is appreciated!

Hi there,

From one member of the cluster, are you able to telnet to another cluster member on the bolt port?

Note: this won't be usable, but you'll see if you can make the connection.

ie,
telnet 172.31.0.78 7687

Does that connect?

I'm trying to see if you have a firewall on each machine which is blocking outside connections? It certainly appears to be able to connect to port 5000 on each machine.

Cheers,
-Ryan

Hi Ryan, I wasn't able to telnet on the bolt port, but not because there was a firewall blocking the requests, but because there was no service running on that port.

  • From the cluster machines, I tried to telnet even to localhost on the bolt port and the connection was refused.
  • In one of the machines I enabled telnet to run on the bolt port (it allowed me, even though neo4j is supposed to be running and thus the port in use), and I was able to telnet from the other machines, so it's accessible.

This leads me to believe that somehow neo4j is not listening on the bolt port. I've tried restarting but still no luck.

Any idea where I could look next?

Thanks!

Hi Cesar,

The instances won't start accepting requests (listening on the BOLT port) until all the clustering communication is working and the cluster has formed. I suspect that there is some networking, or configuration issues not allowing the cluster to form and communicate correctly. Can you attach/post the debug.log from all three instances?

Kind Regards,

Dave

Hi David, these are the debug.log files for the 3 instances.

I think for now I'll just try to recreate the whole cluster from the beginning, re-importing data and all. It's a 1.3 TB database, so it's going to take a while :confused:

Thanks for taking the time to look into this.

Cesar

Hi Cesar,

Can you please share your active neo4j.conf entries as I am facing same issues. Cluster formation completed but http and bold not spawn.

lsof -i -P|grep neo4j

java 163405 neo4j 295u IPv4 896640 0t0 TCP xd1c:5000 (LISTEN)
java 163405 neo4j 346u IPv4 896659 0t0 TCP xd1c:7000 (LISTEN)
java 163405 neo4j 368u IPv4 895767 0t0 TCP xd1c:56647->xd1c4834665n4ja:5000 (ESTABLISHED)
java 163405 neo4j 369u IPv4 895864 0t0 TCP xd1c4834665n4jb:5000->xd1c4834665n4jc:52004 (ESTABLISHED)

Expecting port 7474 to be "LISTEN".

Hi David, I'm really sorry I can't help you, we are not using clustering any more and this was so long ago that I can't remember exactly how I solved the problem. I do remember I solved it and I think it was related to the Firewall and how it was necessary to open more than just one port. There were like 3 or 4 additional ports needed to be open for the clustering to work. But don't take my word for it :speak_no_evil:

Hi Cesar

The issue has been long resolved, main issue was that the cluster node graph db were not in sync from initial setup. thanks for your reply.