Cannot connect to cluster using k8s Ingress

I'm trying to setup a cluster for Neo4J version 4.0.1. I started with the official stable helm chart (more specifically I used an open PR [stable/neo4j] Neo4j fixes by srilumpa · Pull Request #20942 · helm/charts · GitHub because I'm on the latest version of K8s).

The first issue I experienced (which is not present in version 3.x) is related to Unable to run neo4j v4 casual cluster with docker-compose or docker swarm.

I managed to fix that by replacing all occurrences of $(hostname -f) with $(hostname -I | awk '{print $1}').

Once that was fixed, I then tried to connect to the bolt server using the neo4j:// scheme. However I was getting:

{ Neo4jError: Could not perform discovery. No routing servers available. Known routing table: RoutingTable[database=default database, expirationTime=0, currentTime=1577183669630, routers=, readers=, writers=]

I tried the suggestion from this post Neo4jError: Could not perform discovery. No routing servers available - #2 by david.allen but it didn't fix the issue.

To give more details about the setup, I'm using an Ingress LB that forwards requests to a new service I created:

apiVersion: v1
kind: Service
metadata:
  name: neo4j-external-access
  labels:
    app.kubernetes.io/managed-by: {{ .Release.Service | quote }}
    app.kubernetes.io/instance: {{ .Release.Name | quote }}
    helm.sh/chart: "{{ .Chart.Name }}-{{ .Chart.Version }}"
    app.kubernetes.io/name: {{ template "neo4j.name" . }}
    app.kubernetes.io/component: core
spec:
  type: ClusterIP
  ports:
    - name: http
      port: 7474
      targetPort: 7474
    - name: bolt
      port: 7687
      targetPort: 7687
  selector:
    app.kubernetes.io/name: {{ template "neo4j.name" . }}
    app.kubernetes.io/instance: {{ .Release.Name | quote }}
    app.kubernetes.io/component: core

and the Ingress is as follows

  - host: browser.domain.app
    http:
      paths:
      - backend:
          serviceName: neo4j-external-access
          servicePort: 7474tcp
  - host: bolt.domain.app
    http:
      paths:
      - backend:
          serviceName: neo4j-external-access
          servicePort: 7687tcp

Finally, I'm using the following JS code (I use neo4j-driver v4.0.1) to connect to the bolt server:

    const driver = neo4j.driver(
      'neo4j://bolt.domain.app:443',
      neo4j.auth.basic('neo4j', 'password'),
      config
    );
 

I tested the connection using the browser explorer and I still got the same error. The only way I could connect was using bolt:// instead of neo4j://.

I searched for this issue, I've read this article Neo4j Considerations in Orchestration Environments | by David Allen | Neo4j Developer Blog | Medium, as well, but I can't find a solution.

Update

I had some progress which I want to share with you.

I ended up setting a unique bolt_advertised_address for each of the pods. I did that by updating the statefulSet manifest in the neo4j helm chart like this:

SET_INDEX=${HOSTNAME##*-}
export NEO4J_dbms_connector_bolt_advertised__address="bolt-$SET_INDEX.domain.app:443"

I also updated the ingress settings so it includes the above addresses.

I finally managed to the browser explorer using neo4j://bolt-0.domain.app:443.

However, I still get the same error when I'm trying to connect via the neo4j-driver.

Cheers,
Pavlos

The "considerations in orchestration environments" article talks a lot about what I think you're running into here.

When you use a client to connect to neo4j 4 using the neo4j:// scheme, then it will try to "cluster route" queries by default. The way this actually works is that the client gets a "routing table" from the server. The server responds with what cluster members are there, and what their roles in the cluster are (leader, follower, etc). You can check this manually by using browser and calling CALL dbms.cluster.overview();. (This isn't exactly what the driver does but for our purposes here it's very close and shows you similar information)

A key thing here is those bolt advertised addresses and what comes back in the response to that cluster overview call. The driver figures out who exists in the cluster, and then routes queries. When you're inside of kubernetes, it's very easy to advertise an address that isn't externally routable. This is generally the cause of the errors you're seeing.

Without seeing the specifics of your neo4j pod configuration, it's hard to tell. But that's where I'd look. Use browser, CALL dbms.cluster.overview() and pay particular attention to the addresses it reports back for the cluster members. If those addresses aren't routable to your external to kubernetes client, then the client (using the neo4j://server connection scheme) has no chance to do the right thing.

HTH

Hi David,

Thank you for the prompt response.

There are a few things mentioned that are not very obvious for someone who doesn't exactly know the internal working of the cluster setup.

I still can't find answers to very simple questions:

  1. If I don't use NEO4J_dbms_connector_bolt_advertised__address then I can't connect though the Neo4j browser using the neo4j:// scheme

  2. If I use a unique NEO4J_dbms_connector_bolt_advertised__address for each of the three pods I deploy within the StatefulSet then I can connect though the Neo4j browser using the neo4j:// scheme but I can't connect via the JS driver.

  3. The way this actually works is that the client gets a "routing table" from the server.

This is not quire clear to me. What do you mean by server? There are three nodes (1 leader 2 followers) and I use a valid public domain name to access the service though an k8s ingress setup. How is sending that routing table? The leader or any of the node within the cluster.

  1. When you're inside of kubernetes, it's very easy to advertise an address that isn't externally routable

I'm not really inside kubernetes. As I showed above I have created a new service (neo4j-external-access) and there is an Ingress that allows external access to the pods of the statefulSet

  1. Without seeing the specifics of your neo4j pod configuration, it's hard to tell

As I said, i'm using the official helm chart and more specifically a PR that hasn't merged yet [stable/neo4j] Neo4j fixes by srilumpa · Pull Request #20942 · helm/charts · GitHub.

  1. CALL dbms.cluster.overview() returns
╒═════╤═════════════╤══════════════════════════════════════════════════════════════════════╕
│"ttl"│"server.role"│"server.addresses"                                                    │
╞═════╪═════════════╪══════════════════════════════════════════════════════════════════════╡
│300  │"WRITE"      │["bolt-2.domain.app:443"]                         │
├─────┼─────────────┼──────────────────────────────────────────────────────────────────────┤
│300  │"READ"       │["bolt-0.domain.app:443","bolt-1.domain.app:443r│
│     │             ]                                                   │
├─────┼─────────────┼──────────────────────────────────────────────────────────────────────┤
│300  │"ROUTE"      │["bolt-0.domain.app:443","bolt-1.domain.app:443│
│     │             │","bolt-2.domain.app:443"]       │
└─────┴─────────────┴──────────────────────────────────────────────────────────────────────┘

If you ignore the exact domain names (I'd, hiding the ones that I'm actually using) this is the routing table I get back.

Let's take that routing table as an example. Let's take as an example client a python program running outside of Kubernetes. You give it neo4j://whatever as a conncetion point. It is able to make a connection through your ingress. It gets this routing table. It then attempts to set up connections to the addresses "bolt-0.domain.app:443","bolt-1.domain.app:443","bolt-2.domain.app:443".

Notice that for the python program external to kubernetes, those addresses aren't routable, as those DNS names are valid inside kubernetes. Because of this, your python program using the neo4j:// scheme will probably fail to connect because while it contacted the initial member through the kubernetes ingress, it can't contact any other members, because they don't have valid routable addresses (from the perspective of outside kubernetes).

Hi David,

Thank you again for taking the time a response to my questions.

I will try to visually show what my setup looks like.

As you can see there are A records with the external DNS for bolt-0.domain.app:443,bolt-1.domain.app:443 and bolt-2.domain.app:443 that point to the Ingress. So from a network point of view, those addresses are routable to the internal Pods via the Ingress and the neo4-external-access service.

Basically, those addresses aren't valid only inside the kubernetes cluster; they are also visible to the outside world. I think a good indication of that is that I manage to connect via the :7474/browser which successfully creates a wss connection with any of those three URLs.

Ok I finally managed to make it work. The setup was correct; the only issue was with the http port 443. I had to export TCP port 7687 instead to allow incoming connections.

FYI, I followed this instructions Exposing TCP and UDP services - NGINX Ingress Controller

I also updated the advertised bolt address:

SET_INDEX=${HOSTNAME##*-}
export NEO4J_dbms_connector_bolt_advertised__address="bolt-$SET_INDEX.app:7687"

Now I can successfully connect to neo4j://bolt-0.app:7687 via the JS driver.

1 Like

Hi, I am running into the exact same issue, would you be able to post your ingress configuration that you finally ended up with that worked?

1 Like

Since this thread - there have been major updates to helm support, including external exposure. Please follow the directions in this repo:

Do you guys plan on adding documentation for using ingresses as well?

The current helm chart doesn't use/support the use of the nginx-ingress approach, so basically...no. But to be honest, I'm not sure I understand the motivation / desire / need for that approach over what we already do have in place.

So I think it's a good example of something that should be opened as an issue on the repo linked above, where we could discuss that more and see what people are needing/doing.

I have similar issue as ppoliani. First of all, for simplicity reasons I have chosen to deploy a standalone neo4j instance with the aid of helm charts neo4j/neo4j-standalone version 4.4.4 (enterprise).
I am trying to access from ourside Kubernetes cluster a neo4j standalone instance deployed in k8s via nginx ingress. When hiting in my browser the host that has been declared as ingress for neo4j standalone service deployed which has the form: http://<neo4j_service_name>. , i get the neo4j visual interface


but when trying to logging, I get timeout error. This happens with the default set up in the file (located inside helm folder) neo4j-enterprise.conf . Moreover, I connected to neo4jstandalone pod via another pod to check the "routing table" and the output I get is

neo4j@neo4j> CALL dbms.cluster.overview();
+--------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                          | databases                           | groups |
+--------------------------------------------------------------------------------------------------------------------------------------------+
| "5447190b-d57b-4c54-b7f9-3df46f86817e" | ["bolt://localhost:7687", "http://localhost:7474"] | {neo4j: "LEADER", system: "LEADER"} | []     |
+--------------------------------------------------------------------------------------------------------------------------------------------+




I made some changes on neo4j-enterprise.conf, more specifically:

# With default configuration Neo4j only accepts local connections.
# To accept non-local connections, uncomment this line:
dbms.default_listen_address=0.0.0.0
# The address at which this server can be reached by its clients. This may be the server's IP address or DNS name, or
# it may be the address of a reverse proxy which sits in front of the server. This setting may be overridden for
# individual connectors below.
dbms.default_advertised_address="<neo4j_service_name>.<customized-cluster-domain>"
# Bolt connector
dbms.connector.bolt.enabled=true
#dbms.connector.bolt.tls_level=DISABLED
dbms.connector.bolt.listen_address=:7687
dbms.connector.bolt.advertised_address=:7687

# HTTP Connector. There can be zero or one HTTP connectors.
dbms.connector.http.enabled=true
dbms.connector.http.listen_address=:7474
dbms.connector.http.advertised_address=:7474

And now the "routing table" has this view

neo4j@neo4j> CALL dbms.cluster.overview();
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| id                                     | addresses                                                                                                                                                                      | databases                           | groups |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| "5447190b-d57b-4c54-b7f9-3df46f86817e" | ["http://<neo4j_service_name>.<customized-cluster-domain>:7687", "http://<neo4j_service_name>.<customized-cluster-domain>:7474"] | {neo4j: "LEADER", system: "LEADER"} | []     |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

But again, I get the neo4j interface on my browser but cannot logging with the correct username/password, I continue to get timeout errors with both neo4j/bolt connectors. Both bolt and neo4j ports are declared in nginx ingress configuration.
Any ideas?

By the way, I dealt with the same problem exposing the deployment with NodePort.
P.S: I dont have the posibility to use a LoadBalancer.

Thank you in advance for your time!

I finally managed to deploy neo4j-cluster as per documentation instuctions Quickstart: Deploy a cluster - Operations Manual and I finally was able to use LoadBalacer. Some attention should be given to LoadBalancer service selector to match the pods' labels. But the service is not stable meaning that I get many disconnections. In my browser's developers console I get such messages:

 Firefox can’t establish a connection to the server at ws://<external-ip>:7687/. bolt-worker-906e5d4626b98178efa6.js:34:102812
The connection to ws://<external-ip>:7687/ was interrupted while the page was loading. bolt-worker-906e5d4626b98178efa6.js:34:102812

Can this means that I have to configure tls in the neo4j cluster deployment? I am not getting such errors in neo4j standalone deployment.

I finally used LoadBalancer. But in the neo4j cluster deployment I get a lot of disconnections.

How did it resolve? I am having issues in exposing neo4j to external