Hi there,
I followed the instructions from here: https://neo4j.com/docs/operations-manual/4.4/kubernetes/quickstart-cluster/server-setup/ and deployed a cluster of three core members using Terraform.
Used helm-charts: https://github.com/neo4j/helm-charts/releases/tag/4.4.10
Used neo4j version: Neo4j 4.4.11
The code structure is as follows:
module/neo4j:
-main.tf
-variables.tf
--core-1/main.tf
--core-1/variables.tf
--core-1/core-1.values.yaml
--core-2/main.tf
--core-2/variables.tf
--core-2/core-2.values.yaml
--core-3/main.tf
--core-3/variables.tf
--core-3/core-3.values.yaml
So the root main.tf creates modules of each core. Nothing special, nothing fancy.
The problem I am facing is: The deployment passes like only 1 out of 10 times. Whenever the deployment fails, it is due to a time-out of the Terraform helm_release of one or two core members stating: "Secret "neo4j-cluster-auth" exists.
Looking into the log of the one (or two) members already deployed, the startup failed, because the cluster is missing members. (initialDelaySeconds have been configured for each core member and have been increased testwise too)
2022-11-17 08:59:22.738+0000 ERROR Failed to start Neo4j on 0.0.0.0:7474.
java.lang.RuntimeException: Error starting Neo4j database server at /var/lib/neo4j/data/databases
at org.neo4j.graphdb.facade.DatabaseManagementServiceFactory.startDatabaseServer(DatabaseManagementServiceFactory.java:227) ~[neo4j-4.4.11.jar:4.4.11]
at org.neo4j.graphdb.facade.DatabaseManagementServiceFactory.build(DatabaseManagementServiceFactory.java:180) ~[neo4j-4.4.11.jar:4.4.11]
at com.neo4j.causalclustering.core.CoreGraphDatabase.createManagementService(CoreGraphDatabase.java:38) ~[neo4j-causal-clustering-4.4.11.jar:4.4.11]
at com.neo4j.causalclustering.core.CoreGraphDatabase.<init>(CoreGraphDatabase.java:30) ~[neo4j-causal-clustering-4.4.11.jar:4.4.11]
at com.neo4j.server.enterprise.EnterpriseManagementServiceFactory.createManagementService(EnterpriseManagementServiceFactory.java:34) ~[neo4j-enterprise-4.4.11.jar:4.4.11]
at com.neo4j.server.enterprise.EnterpriseBootstrapper.createNeo(EnterpriseBootstrapper.java:20) ~[neo4j-enterprise-4.4.11.jar:4.4.11]
at org.neo4j.server.NeoBootstrapper.start(NeoBootstrapper.java:142) [neo4j-4.4.11.jar:4.4.11]
at org.neo4j.server.NeoBootstrapper.start(NeoBootstrapper.java:95) [neo4j-4.4.11.jar:4.4.11]
at com.neo4j.server.enterprise.EnterpriseEntryPoint.main(EnterpriseEntryPoint.java:24) [neo4j-enterprise-4.4.11.jar:4.4.11]
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Component 'com.neo4j.dbms.ClusteredDbmsReconcilerModule@5c2ae7d7' was successfully initialized, but failed to start. Please see the attached cause exception "Failed to join or bootstrap a raft group with id RaftGroupId{00000000} and members RaftMembersSnapshot{raftGroupId=Not yet published, raftMembersSnapshot={ServerId{c72f54d8}=Published as : RaftMemberId{c72f54d8}}} in time. Please restart the cluster. Clue: not enough cores found".
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.start(LifeSupport.java:463) ~[neo4j-common-4.4.11.jar:4.4.11]
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:110) ~[neo4j-common-4.4.11.jar:4.4.11]
at org.neo4j.graphdb.facade.DatabaseManagementServiceFactory.startDatabaseServer(DatabaseManagementServiceFactory.java:218) ~[neo4j-4.4.11.jar:4.4.11]
... 8 more
Hope someone can help! Thanks a lot,
Cheers Theresa