Running a neo4j cluster on Amazon EKS kubernetes

toolenaar · October 24, 2018, 2:39pm

Hi I am trying to run a neo4j cluster on Amazone EKS using the following guide => GitHub - neo4j-contrib/kubernetes-neo4j: (RETIRED) Kubernetes experiments with Neo4j. See updated Helm Repo

The guide was perfect I was able to setup the cluster without any problems. After the setup I am able to connect to the database useing kubectl or if I setup a proxy to the DB (as described in the guide) everything is working fine.

However when I try to connect the DB using my Node api.I run into problems. I have also tried to setup an external load balancer. So I could try to connect to the DB from outside my kubernetes cluster. But without any success.

I am using the following url to connect to the DB from my node.js api using the javascript driver. "bolt+routing://neo4j-core-0.neo4j.default.svc.cluster.local:7687" (I have also tried a long list of variations including static ips etc, all giving the same error)

Error : "Could not perform discovery. No routing servers available. Known routing table: RoutingTable[expirationTime=0, routers=[], readers=[], writers=[]]"

I am using the following code to initialize the driver

const driver = neo4j.driver(process.env.NEO4J_ENDPOINT, neo4j.auth.basic(process.env.NEO4J_USER, process.env.NEO4J_PASSWWORD));

I am very new to both neo4j and kubernetes, so I am not sure if this is an issue with how I am using the JS driver or how kubernetes is set up.

I have been stuck at this for a couple of days now. Any help/advise is very appreciated.

david_allen · October 24, 2018, 10:56pm

Please have a look at this link; while it's about neo4j on google kubernetes engine, the same concepts apply in your case on azure (specifically in this link the "Limitations" section)

github.com

neo-technology/neo4j-google-k8s-marketplace/blob/main/user-guide/USER-GUIDE.md#limitations

# Neo4j on Google Kubernetes Engine User Guide

## Overview

Neo4j on GKE allows users to deploy multi-node Neo4j Enterprise Causal Clusters to GKE instances, with configuration options for the most common scenarios.  It represents a very rapid way to get started running the world leading native graph database on top of Kubernetes.

This guide is intended only as a supplement to the [Neo4j Operations Manual](https://neo4j.com/docs/operations-manual/4.4/?ref=googlemarketplace).   Neo4j on GKE is essentially a docker container based deploy of Neo4j Causal Cluster.  As such, all of the information in the Operations Manual applies to its operation, and this guide will focus only on kubernetes-specific concerns and GKE-specific concerns.

## Licensing & Cost

Neo4j on GKE is available to any existing enterprise license holder of Neo4j in a Bring Your Own License (BYOL) arrangement.  Neo4j on GKE is also available under evaluation licenses, contact Neo4j in order to obtain one.   There is no hourly or metered cost associated with using Neo4j on GKE for current license holders; you will pay only for the google compute infrastructure necessary to run the software.

## One time Setup

Before installing Neo4j into your GKE cluster, confirm the following:
- You should have docker and kubectl installed locally from the machine where you want to use neo4j
- You have authenticated google’s CLI tools (gcloud) locally to your account.
- You have run gcloud container clusters get-credentials to configure your local kubectl client to interact with your GKE cluster.
- You should verify that you hold an existing Neo4j Enterprise license, whether purchased, via the startup program, or on an evaluation basis.

This file has been truncated. show original

The issue you're running into is that each node gets a private internal kubernetes DNS address (neo4j-core-0.neo4j.default.svc.cluster.local). When your bolt+routing client connects to the cluster, it gets told by the nodes of the cluster that this is their address (check inside of your pod, your default_advertised_address setting in neo4j). That address is perfectly good inside of kubernetes but it's not resolveable outside of kubernetes.

You have two options for how to resolve:

Advertise to the outside world a good DNS address, and then set up the kubernetes services needed to route traffic from outside the cluster to the right pod (you can use NodePorts for example to do this in k8s)
Use bolt+routing only inside of kubernetes, and use a straight bolt client (not bolt+routing) external to kubernetes.

Hope this helps.

toolenaar · October 25, 2018, 8:19am

Thanks for the quick reply. Some more information about my setup. I am trying to connect my node.js api which runs on another pod within the same neo4j cluster but on another node. Does this mean I will be able to connect to the neo4j-cluster inside of kubernetes (this is what I have been trying) or does this mean I have to expose the neo4j-cluster to the outside world and use this address.

I have been trying to do the later but I am unable to get a connection when I am setting up a NodePort (also tried a load balancer just to test but cant resolve the ip address with that either)

apiVersion: v1
kind: Service
metadata:
  name: neo4j
  labels:
    app: neo4j
    component: core
spec:
  clusterIP: None
  ports:
    - port: 7474
      targetPort: 7474
      name: browser
    - port: 6362
      targetPort: 6362
      name: backup
  selector:
    app: neo4j
    component: core
---
apiVersion: v1
kind: Service
metadata:
  name: neo4j-external
  labels:
    app: neo4j
    component: core
spec:
  type: NodePort
  ports:
    - port: 7474
      targetPort: 7474
      name: browser-external
     - port: 7687
      targetPort: 7687
      name: bolt-external
  selector:
    app: neo4j
    component: core

toolenaar · October 26, 2018, 8:27am

I am now trying to install neo4j using the Helm pacakge (charts/stable/neo4j at master · helm/charts · GitHub)

I use the following command to install it onto AWS EKS :

helm install --name neo4j-helm stable/neo4j --set acceptLicenseAgreement=yes --set neo4jPassword=mypw --set core.numberOfServers=2 --set readReplica.numberOfServers=1 --set core.persistentVolume.storageClass=gp2 --set core.persistentVolume.size=50Gi

However my core pods are stuck in a state where they cannot connect to each other "Attempting to connect to the other cluster members before continuing..."

Any idea what might be going wrong here ? Or where I can look to figure out what is going wrong?

david_allen · October 26, 2018, 3:38pm

Are these things deployed into the same kubernetes namespace?

In terms of the server not being able to discover the other members, this is typically because traffic is blocked on port 5000, 6000, or 7000. By default this should just work out of the box, but I'm not familiar with the networking specifics of AKS, so if there are special considerations there I'm afraid I can't help. But that's a place to look into.

For the JS app, try first just a bolt driver (bolt:// instead of bolt+routing://) and see if this works. Follow up with the details of trying that inside of the kubernetes cluster. It would be good to narrow down if the issue is that you can't talk to the cluster at all, or if it's an issue with routing.

Also, did you make any modifications to the helm chart?

toolenaar · October 27, 2018, 10:32am

You are correct that it has probably nothing todo with the setup of neo4j but in the way my node.js api tries to connect with it.

I have now tried to setup a single instance of neo4j instead of a cluster (which is also working correctly) I have opened this up externally bolt://adfb11b6ed9c211e8a1230602c0c3d18-669561371.eu-west-1.elb.amazonaws.com:7687. But when I use this I am getting the following error

{ Neo4jError: Failed to establish connection in 5000ms
     at captureStacktrace (/usr/src/app/node_modules/neo4j-driver/lib/v1/result.js:200:15)
    at new Result (/usr/src/app/node_modules/neo4j-driver/lib/v1/result.js:73:19)
    at Session._run (/usr/src/app/node_modules/neo4j-driver/lib/v1/session.js:173:14)
    at Session.run (/usr/src/app/node_modules/neo4j-driver/lib/v1/session.js:154:19)
    at DatabaseInitializer.<anonymous> (/usr/src/app/dist/db/DatabaseInitializer.js:57:66)
    at step (/usr/src/app/dist/db/DatabaseInitializer.js:32:23)
    at Object.next (/usr/src/app/dist/db/DatabaseInitializer.js:13:53)
    at /usr/src/app/dist/db/DatabaseInitializer.js:7:71
    at new Promise (<anonymous>)
    at __awaiter (/usr/src/app/dist/db/DatabaseInitializer.js:3:12) code: 'ServiceUnavailable', name: 'Neo4jError' }

But the silly thing is if I try to connect to that external endpoint locally in my development environment everything is working fine.

toolenaar · October 27, 2018, 1:12pm

Sigh.... I finally got fed up, and reinstalled the entire cluster. And now everything is working fine. Still dont know what went wrong. KubeDNS probably got fucked up in the previous installation. Thanks for the help tho. Really appreciate the feedback and the neo4j community.

Note: I was only abele to run a single instance. Setting up the helm package still gets stuck at

Attempting to connect to the other cluster members before continuing...

toolenaar · October 27, 2018, 2:36pm

Running into all other kinds of misery now.

0/4 nodes are available: 1 node(s) had no available volume zone, 3 node(s) didn't match node selector.

It seems my persistent volumes are not created in the same availability zones as the Node on which neo4j's pod is running.

Fixed this by using the following StorageClass (set the zone to the zone you want to use)

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: gp2
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
  zone: eu-west-1c
reclaimPolicy: Retain
mountOptions:
  - debug

In the end I am only able to get it to work using a single neo4j instance exposed externally using a LoadBalancer. I am not really happy about this but, with my current expertise I am not able to find another solution

david_allen · October 28, 2018, 7:02pm

Sorry that I can't help more with this. In terms of the helm failures, and in particular your note about the volumes getting created in a different AZ, this is strongly pointing out to some unknown weirdness in the way that EKS is operating. Creating persistent volumes in a different AZ doesn't really make much sense -- I suppose you could do it, but putting a volume network far away from your nodes would 100% make performance suffer I'd guess, you usually want your storage very close to your compute (in network terms).

It should be said that in the kubernetes world, persistent volume claims and stateful sets are kind of the new kids on the block, and as the various cloud vendors create their own distributions of kubernetes (such as EKS) there aren't absolute guarantees they all work the same.

As for the cluster not forming, the only way to tell what's happening there is to shell into the containers, and inspect the debug.log file, which maybe in /var/log/neo4j on the system.

The debug.log file has a series of events in it that signal cluster lifecycle events. When a cluster starts, it has top contact its buddies, elect a leader for the cluster, and then run transaction catch-up to make sure that they all have the same dataset. Generally looking at all 3 debug.log files, you can figure out what's going off the rails. Either it will fail to discover a buddy (strongly indicating a network problem between nodes), it will fail to elect a leader (various causes possible) or it will fail to synchronize data (again, various causes possible). My bet for your situation would be on failure to discover the buddies -- in part because you're reporting this separate issue with persistent volumes being in the wrong AZ -- I can't see your system but I have a hunch EKS is putting your pods in weird places and not guaranteeing a network path between them.

Other than give you that as a general area to look into, I'm not sure what else to suggest because I'm not that familiar with the vagaries of EKS. The helm chart has been tested with Google GKE, minikube installed on local machines, and homebrew kubernetes, but not with EKS - it's quite believable though that Amazon-specific EKS config may be required.

tech-mint · March 13, 2019, 6:07am

I did not install with helm since my environment is Openshift and does not support helm out of the box. The problem is still here. In Openshift internal DNS is provided by SkyDNS.

I deployed based on the yamls in here .

My image is neo4j:3.4-enterprise.

2019-03-13 01:07:47.217+0000 INFO  Resolved initial host 'neo4j.namespace-uat.svc.cluster.local:5000' to [94.237.xx.xxx:5000, 94.237.xx.xxx:5000, 94.237.xx.xxx:5000]
2019-03-13 01:07:47.251+0000 INFO  Attempting to connect to the other cluster members before continuing...

I don't want to downgrade the cluster into a single not-HA instance, but it seems I have to.

david_allen · March 13, 2019, 11:10am

EDIT

I'm replacing an old out of date answer with the link to the new helm charts. They're working for clusters of Neo4j 4.0, and include external exposure information for the use of bolt+routing & neo4j:// external to kubernetes.

Topic		Replies	Views
Cannot connect to cluster using k8s Ingress Neo4j Graph Platform	13	3141	April 20, 2022
Launching Neo4j on Google Kubernetes Marketplace Community Content & Blogs	9	1961	August 19, 2019
Launching Neo4j cluster on GKE and properly exposing it for external services Aura & Cloud gke	4	1232	October 2, 2019
Establishing connection between other apps and neo4j Aura & Cloud kubernetes	2	1135	November 6, 2019
Plans for bolt+routing for external access to neo4j cluster Orchestration & Kubernetes	8	3457	March 17, 2022

Running a neo4j cluster on Amazon EKS kubernetes

Related topics