Causal cluster - Neo4j not running but it is?

cesar · May 23, 2019, 9:21pm

Hi Everyone! I'm having this little issue with a cluster I'm trying to create with 3 instances. The cluster seems to be created correctly in the logs, the neo4j processes are running but none of the members accept http or bolt connections.
This is the last entry on the logs from all the members (changing the ips):

2019-05-23 21:00:12.413+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Cluster discovery service starting
2019-05-23 21:00:12.438+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] My connection info: [
        Discovery:   listen=0.0.0.0:5000, advertised=172.31.0.78:5000,
        Transaction: listen=0.0.0.0:6000, advertised=172.31.0.78:6000, 
        Raft:        listen=0.0.0.0:7000, advertised=172.31.0.78:7000, 
        Client Connector Addresses: bolt://172.31.0.78:7687,http://172.31.0.78:7474,https://172.31.0.78:7473
]
2019-05-23 21:00:12.438+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Discovering other core members in initial members set: [172.31.0.76:5000, 172.31.0.77:5000, 172.31.0.78:5000]
2019-05-23 21:00:12.482+0000 INFO [o.n.c.c.c.l.s.SegmentedRaftLog] log started with recovered state State{prevIndex=-1, prevTerm=-1, appendIndex=-1}
2019-05-23 21:00:12.482+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state before recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}
2019-05-23 21:00:12.483+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Recovering from: -1 to: -1
2019-05-23 21:00:12.484+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Membership state after recovery: RaftMembershipState{committed=null, appended=null, ordinal=-1}
2019-05-23 21:00:12.484+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: []
2019-05-23 21:00:12.557+0000 INFO [o.n.c.n.Server] raft-server: bound to 0.0.0.0:7000
2019-05-23 21:00:21.146+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Cluster discovery service started
2019-05-23 21:00:21.173+0000 INFO [o.n.c.d.CoreMonitor] Bound to cluster with id f2123d79-01c7-4bdf-a118-b96ebe5dc762
2019-05-23 21:00:21.320+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core topology changed {added=[{memberId=MemberId{16939e8d}, info=CoreServerInfo{raftServer=172.31.0.76:7000, catchupServer=172.31.0.76:6000, clientConnectorAddresses=bolt://172.31.0.76:7687,http://172.31.0.76:7474,https://172.31.0.76:7473, groups=[], database=default, refuseToBeLeader=false}}, {memberId=MemberId{3c8b291f}, info=CoreServerInfo{raftServer=172.31.0.78:7000, catchupServer=172.31.0.78:6000, clientConnectorAddresses=bolt://172.31.0.78:7687,http://172.31.0.78:7474,https://172.31.0.78:7473, groups=[], database=default, refuseToBeLeader=false}}], removed=[]}
2019-05-23 21:00:21.320+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: [MemberId{16939e8d}, MemberId{3c8b291f}]
2019-05-23 21:00:21.328+0000 INFO [o.n.c.d.CoreMonitor] Discovered core member at 172.31.0.76:5000
2019-05-23 21:00:26.074+0000 INFO [o.n.c.d.CoreMonitor] Discovered core member at 172.31.0.77:5000
2019-05-23 21:00:26.077+0000 INFO [c.n.c.d.SslHazelcastCoreTopologyService] Core topology changed {added=[{memberId=MemberId{1a32017e}, info=CoreServerInfo{raftServer=172.31.0.77:7000, catchupServer=172.31.0.77:6000, clientConnectorAddresses=bolt://172.31.0.77:7687,http://172.31.0.77:7474,https://172.31.0.77:7473, groups=[], database=default, refuseToBeLeader=false}}], removed=[]}
2019-05-23 21:00:26.077+0000 INFO [o.n.c.c.c.m.RaftMembershipManager] Target membership: [MemberId{16939e8d}, MemberId{1a32017e}, MemberId{3c8b291f}]

These are the cluster configs:

dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=172.31.0.78
dbms.mode=CORE
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3
causal_clustering.initial_discovery_members=172.31.0.76:5000,172.31.0.77:5000,172.31.0.78:5000

When I run cypher-shell in any of the instances:

# cypher-shell
Connection refused

Neo4j status in all the members says it's not running:

# neo4j status
Neo4j is not running

And yet I can see in all the instances that neo4j IS running. What else can I check?
Thanks! Any help is appreciated!

ryan.boyd · May 24, 2019, 10:27pm

Hi there,

From one member of the cluster, are you able to telnet to another cluster member on the bolt port?

Note: this won't be usable, but you'll see if you can make the connection.

ie,
telnet 172.31.0.78 7687

Does that connect?

I'm trying to see if you have a firewall on each machine which is blocking outside connections? It certainly appears to be able to connect to port 5000 on each machine.

Cheers,
-Ryan

cesar · May 26, 2019, 10:08am

Hi Ryan, I wasn't able to telnet on the bolt port, but not because there was a firewall blocking the requests, but because there was no service running on that port.

From the cluster machines, I tried to telnet even to localhost on the bolt port and the connection was refused.
In one of the machines I enabled telnet to run on the bolt port (it allowed me, even though neo4j is supposed to be running and thus the port in use), and I was able to telnet from the other machines, so it's accessible.

This leads me to believe that somehow neo4j is not listening on the bolt port. I've tried restarting but still no luck.

Any idea where I could look next?

Thanks!

david.shiposh · May 27, 2019, 2:21pm

Hi Cesar,

The instances won't start accepting requests (listening on the BOLT port) until all the clustering communication is working and the cluster has formed. I suspect that there is some networking, or configuration issues not allowing the cluster to form and communicate correctly. Can you attach/post the debug.log from all three instances?

Kind Regards,

Dave

cesar · May 28, 2019, 9:40am

Hi David, these are the debug.log files for the 3 instances.

I think for now I'll just try to recreate the whole cluster from the beginning, re-importing data and all. It's a 1.3 TB database, so it's going to take a while

Thanks for taking the time to look into this.

Cesar

david.kuku · October 25, 2019, 9:50am

Hi Cesar,

Can you please share your active neo4j.conf entries as I am facing same issues. Cluster formation completed but http and bold not spawn.

lsof -i -P|grep neo4j

java 163405 neo4j 295u IPv4 896640 0t0 TCP xd1c:5000 (LISTEN)
java 163405 neo4j 346u IPv4 896659 0t0 TCP xd1c:7000 (LISTEN)
java 163405 neo4j 368u IPv4 895767 0t0 TCP xd1c:56647->xd1c4834665n4ja:5000 (ESTABLISHED)
java 163405 neo4j 369u IPv4 895864 0t0 TCP xd1c4834665n4jb:5000->xd1c4834665n4jc:52004 (ESTABLISHED)

Expecting port 7474 to be "LISTEN".

cesar · October 25, 2019, 2:29pm

Hi David, I'm really sorry I can't help you, we are not using clustering any more and this was so long ago that I can't remember exactly how I solved the problem. I do remember I solved it and I think it was related to the Firewall and how it was necessary to open more than just one port. There were like 3 or 4 additional ports needed to be open for the clustering to work. But don't take my word for it

david.kuku · December 4, 2019, 2:38pm

Hi Cesar

The issue has been long resolved, main issue was that the cluster node graph db were not in sync from initial setup. thanks for your reply.

Topic		Replies	Views
Causal Cluster not forming Neo4j Graph Platform	5	5679	October 18, 2018
Cluster hanging on attempt to connect to the other cluster members Cluster	2	1296	May 27, 2019
Neo4j Causal Cluster fails to form despite service showing ok Cluster cluster	0	84	May 20, 2024
New4j Cluster Stuck in discovery Cluster	17	1758	May 15, 2021
Neo4j Not able to form casual cluster - attempting to connect Neo4j Graph Platform	2	1098	July 16, 2019

Demystifying Neo4j UX Research

Causal cluster - Neo4j not running but it is?

lsof -i -P|grep neo4j

Related topics