Cluster hanging on attempt to connect to the other cluster members

I am trying to run Neo4j-Enterprise-3.1.2 in a cluster with 3 machines where each of them is configured to be a CORE instance. The following entries are changed in the neo4j.conf file:

dbms.connectors.default_listen_address=0.0.0.0
dbms.connectors.default_advertised_address=IP_SERVER1
dbms.mode=CORE
causal_clustering.expected_core_cluster_size=3
causal_clustering.initial_discovery_members=IP_SERVER1:5000,IP_SERVER2:5000,IP_SERVER3:5000
dbms.memory.heap.initial_size=10g
dbms.memory.heap.max_size=10g
dbms.memory.pagecache.size=11g
dbms.security.procedures.unrestricted=algo.*

dbms.connector.bolt.enabled=true
dbms.connector.bolt.listen_address=:7687

dbms.connector.http.enabled=true
dbms.connector.http.listen_address=:7474

dbms.connector.https.enabled=true
dbms.connector.https.listen_address=:7473

However, the cluster is never starting and I can read only the following lines in the log file. Also, port 5000 is open and accessible from other servers.

**
2019-05-22 10:36:32.988+0000 INFO Starting...
2019-05-22 10:36:34.787+0000 INFO Bolt enabled on 0.0.0.0:7687.
2019-05-22 10:36:34.804+0000 INFO Initiating metrics...
2019-05-22 10:36:35.000+0000 INFO My connection info: [
Discovery: listen=0.0.0.0:5000, advertised=IP_SERVER1:5000,
Transaction: listen=0.0.0.0:6000, advertised=IP_SERVER1:6000,
Raft: listen=0.0.0.0:7000, advertised=IP_SERVER1:7000,
Client Connector Addresses: bolt://IP_SERVER1:7687,http://IP_SERVER1:7474,https://IP_SERVER1:7473
]
2019-05-22 10:36:35.001+0000 INFO Discovering cluster with initial members: [IP_SERVER1:5000, IP_SERVER2:5000, IP_SERVER3:5000]
2019-05-22 10:36:35.001+0000 INFO Attempting to connect to the other cluster members before continuing...
**

Hi There,

A quick note - 3.1.2 is quite dated and if you're just getting started, I would recommend starting with 3.5.6.

As for the issue, this is usually due to a network and/or configuration issue. Make sure that in addition to port 5000, the instances in the cluster can also communicate over ports 6000 and 7000. Are there any details/messages in the debug.log?

Kind Regards,
Dave

Hi David,

I can only use the version 3.1.2 because this was offered to me as an academic licence.

Regarding the ports is not a communication issue because all ports are open (5000, 6000 and 7000). Also, it is not a configuration issue and there is nothing more in the logs beside what I have attached in the previous thread.

I kind of solved the issue by following the below procedure but is very strange to me the reason why it was solved. Furthermore, there is no INFO in the logs or debug to understand the reason of the failure (the error shows that is a timeout due to the fact that the cores were not reached for the discovery process).

Steps to finally start the cluster:

  1. Stop Neo4j on each core server.
  2. Unbind each of the core servers (neo4j-admin unbind).
  3. Delete graph folder (graph.db).
  4. Delete cluster folder if not deleted from the unbind command.
  5. Start again each core instance in cluster mode.

In case the cluster is still not starting, it can be performed the following actions after step 3:
a) Start the core servers in single mode individually (change the needed setting in neo4j.conf)
b) Stop Neo4j service.
c) Delete again the graph folder and continue with Step 5 above.

Does Neo4j provide any more detailed logging? Because doing such trick without knowing the reason of the error makes impossible to maintain a production environment...