Running a neo4j cluster on Amazon EKS kubernetes

kubernetes
aws
cluster
js-driver

(Toolenaar) #1

Hi I am trying to run a neo4j cluster on Amazone EKS using the following guide => https://github.com/neo4j-contrib/kubernetes-neo4j

The guide was perfect I was able to setup the cluster without any problems. After the setup I am able to connect to the database useing kubectl or if I setup a proxy to the DB (as described in the guide) everything is working fine.

However when I try to connect the DB using my Node api.I run into problems. I have also tried to setup an external load balancer. So I could try to connect to the DB from outside my kubernetes cluster. But without any success.

I am using the following url to connect to the DB from my node.js api using the javascript driver. "bolt+routing://neo4j-core-0.neo4j.default.svc.cluster.local:7687" (I have also tried a long list of variations including static ips etc, all giving the same error)

Error : "Could not perform discovery. No routing servers available. Known routing table: RoutingTable[expirationTime=0, routers=[], readers=[], writers=[]]"

I am using the following code to initialize the driver

const driver = neo4j.driver(process.env.NEO4J_ENDPOINT, neo4j.auth.basic(process.env.NEO4J_USER, process.env.NEO4J_PASSWWORD));

I am very new to both neo4j and kubernetes, so I am not sure if this is an issue with how I am using the JS driver or how kubernetes is set up.

I have been stuck at this for a couple of days now. Any help/advise is very appreciated.


(M. David Allen) #2

Please have a look at this link; while it's about neo4j on google kubernetes engine, the same concepts apply in your case on azure (specifically in this link the "Limitations" section)

The issue you're running into is that each node gets a private internal kubernetes DNS address (neo4j-core-0.neo4j.default.svc.cluster.local). When your bolt+routing client connects to the cluster, it gets told by the nodes of the cluster that this is their address (check inside of your pod, your default_advertised_address setting in neo4j). That address is perfectly good inside of kubernetes but it's not resolveable outside of kubernetes.

You have two options for how to resolve:

  1. Advertise to the outside world a good DNS address, and then set up the kubernetes services needed to route traffic from outside the cluster to the right pod (you can use NodePorts for example to do this in k8s)
  2. Use bolt+routing only inside of kubernetes, and use a straight bolt client (not bolt+routing) external to kubernetes.

Hope this helps.


(Toolenaar) #3

Thanks for the quick reply. Some more information about my setup. I am trying to connect my node.js api which runs on another pod within the same neo4j cluster but on another node. Does this mean I will be able to connect to the neo4j-cluster inside of kubernetes (this is what I have been trying) or does this mean I have to expose the neo4j-cluster to the outside world and use this address.

I have been trying to do the later but I am unable to get a connection when I am setting up a NodePort (also tried a load balancer just to test but cant resolve the ip address with that either)

apiVersion: v1
kind: Service
metadata:
  name: neo4j
  labels:
    app: neo4j
    component: core
spec:
  clusterIP: None
  ports:
    - port: 7474
      targetPort: 7474
      name: browser
    - port: 6362
      targetPort: 6362
      name: backup
  selector:
    app: neo4j
    component: core
---
apiVersion: v1
kind: Service
metadata:
  name: neo4j-external
  labels:
    app: neo4j
    component: core
spec:
  type: NodePort
  ports:
    - port: 7474
      targetPort: 7474
      name: browser-external
     - port: 7687
      targetPort: 7687
      name: bolt-external
  selector:
    app: neo4j
    component: core

(Toolenaar) #4

I am now trying to install neo4j using the Helm pacakge (https://github.com/helm/charts/tree/master/stable/neo4j)

I use the following command to install it onto AWS EKS :

helm install --name neo4j-helm stable/neo4j --set acceptLicenseAgreement=yes --set neo4jPassword=mypw --set core.numberOfServers=2 --set readReplica.numberOfServers=1 --set core.persistentVolume.storageClass=gp2 --set core.persistentVolume.size=50Gi

However my core pods are stuck in a state where they cannot connect to each other "Attempting to connect to the other cluster members before continuing..."

Any idea what might be going wrong here ? Or where I can look to figure out what is going wrong?


(M. David Allen) #5

Are these things deployed into the same kubernetes namespace?

In terms of the server not being able to discover the other members, this is typically because traffic is blocked on port 5000, 6000, or 7000. By default this should just work out of the box, but I'm not familiar with the networking specifics of AKS, so if there are special considerations there I'm afraid I can't help. But that's a place to look into.

For the JS app, try first just a bolt driver (bolt:// instead of bolt+routing://) and see if this works. Follow up with the details of trying that inside of the kubernetes cluster. It would be good to narrow down if the issue is that you can't talk to the cluster at all, or if it's an issue with routing.

Also, did you make any modifications to the helm chart?


(Toolenaar) #6

You are correct that it has probably nothing todo with the setup of neo4j but in the way my node.js api tries to connect with it.

I have now tried to setup a single instance of neo4j instead of a cluster (which is also working correctly) I have opened this up externally bolt://adfb11b6ed9c211e8a1230602c0c3d18-669561371.eu-west-1.elb.amazonaws.com:7687. But when I use this I am getting the following error

{ Neo4jError: Failed to establish connection in 5000ms
     at captureStacktrace (/usr/src/app/node_modules/neo4j-driver/lib/v1/result.js:200:15)
    at new Result (/usr/src/app/node_modules/neo4j-driver/lib/v1/result.js:73:19)
    at Session._run (/usr/src/app/node_modules/neo4j-driver/lib/v1/session.js:173:14)
    at Session.run (/usr/src/app/node_modules/neo4j-driver/lib/v1/session.js:154:19)
    at DatabaseInitializer.<anonymous> (/usr/src/app/dist/db/DatabaseInitializer.js:57:66)
    at step (/usr/src/app/dist/db/DatabaseInitializer.js:32:23)
    at Object.next (/usr/src/app/dist/db/DatabaseInitializer.js:13:53)
    at /usr/src/app/dist/db/DatabaseInitializer.js:7:71
    at new Promise (<anonymous>)
    at __awaiter (/usr/src/app/dist/db/DatabaseInitializer.js:3:12) code: 'ServiceUnavailable', name: 'Neo4jError' }

But the silly thing is if I try to connect to that external endpoint locally in my development environment everything is working fine.


(Toolenaar) #7

Sigh.... I finally got fed up, and reinstalled the entire cluster. And now everything is working fine. Still dont know what went wrong. KubeDNS probably got fucked up in the previous installation. Thanks for the help tho. Really appreciate the feedback and the neo4j community.

Note: I was only abele to run a single instance. Setting up the helm package still gets stuck at

Attempting to connect to the other cluster members before continuing...

(Toolenaar) #8

Running into all other kinds of misery now.

0/4 nodes are available: 1 node(s) had no available volume zone, 3 node(s) didn't match node selector.

It seems my persistent volumes are not created in the same availability zones as the Node on which neo4j's pod is running.

Fixed this by using the following StorageClass (set the zone to the zone you want to use)

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp2
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
  zone: eu-west-1c
reclaimPolicy: Retain
mountOptions:
  - debug

In the end I am only able to get it to work using a single neo4j instance exposed externally using a LoadBalancer. I am not really happy about this but, with my current expertise I am not able to find another solution :frowning:


(M. David Allen) #9

Sorry that I can't help more with this. In terms of the helm failures, and in particular your note about the volumes getting created in a different AZ, this is strongly pointing out to some unknown weirdness in the way that EKS is operating. Creating persistent volumes in a different AZ doesn't really make much sense -- I suppose you could do it, but putting a volume network far away from your nodes would 100% make performance suffer I'd guess, you usually want your storage very close to your compute (in network terms).

It should be said that in the kubernetes world, persistent volume claims and stateful sets are kind of the new kids on the block, and as the various cloud vendors create their own distributions of kubernetes (such as EKS) there aren't absolute guarantees they all work the same.

As for the cluster not forming, the only way to tell what's happening there is to shell into the containers, and inspect the debug.log file, which maybe in /var/log/neo4j on the system.

The debug.log file has a series of events in it that signal cluster lifecycle events. When a cluster starts, it has top contact its buddies, elect a leader for the cluster, and then run transaction catch-up to make sure that they all have the same dataset. Generally looking at all 3 debug.log files, you can figure out what's going off the rails. Either it will fail to discover a buddy (strongly indicating a network problem between nodes), it will fail to elect a leader (various causes possible) or it will fail to synchronize data (again, various causes possible). My bet for your situation would be on failure to discover the buddies -- in part because you're reporting this separate issue with persistent volumes being in the wrong AZ -- I can't see your system but I have a hunch EKS is putting your pods in weird places and not guaranteeing a network path between them.

Other than give you that as a general area to look into, I'm not sure what else to suggest because I'm not that familiar with the vagaries of EKS. The helm chart has been tested with Google GKE, minikube installed on local machines, and homebrew kubernetes, but not with EKS - it's quite believable though that Amazon-specific EKS config may be required.