Startup probe failed: dial tcp <pod_ip_address>:7687: connect: connection refused

Hi Folks,

I hope someone can point me in the right direction to find the root cause of this error. I have an Azure k8s instance of Neo4j, which I installed using helm with minimal custom config. Using azure disk like this:

volumes:

data:

mode: "volume"

volume:

  azureDisk:

    diskName: "neo4j-disk-stn"

    diskURI: "/subscriptions/696xxx-xxx-xxx6-xx-xxxx/resourceGroups/MC\_rg-kube-xxx-westeu-02\_kube-xxx-02\_westeu/providers/Microsoft.Compute/disks/neo4j-disk-stn"

    kind: Managed

Devs are using it for evaluation and today I just spent hours troubleshooting trying to get it up and running after the azureDisk was resized to 256GB by the devs.

The pod is in a crash loop doesn't start at all it fails with the following:

Events:
  Type     Reason     Age                    From               Message
  ----     ------     ----                   ----               -------
  Normal   Scheduled  24m                    default-scheduler  Successfully assigned neo4j-ee-stn/neo4j-ee-stn-release-0 to aks-neo4j-33193140-vmss00002g
  Warning  Unhealthy  23m (x6 over 24m)      kubelet            Startup probe failed: dial tcp 10.244.0.15:7687: connect: connection refused
  Normal   Pulled     23m (x4 over 24m)      kubelet            Container image "neo4j:4.4.5-enterprise" already present on machine
  Normal   Created    23m (x4 over 24m)      kubelet            Created container neo4j
  Normal   Started    23m (x4 over 24m)      kubelet            Started container neo4j
  Warning  BackOff    4m16s (x100 over 24m)  kubelet            Back-off restarting failed container

SSL is disabled & I have also increased the startupProbe -> failureTreshhold explicitly;

Any ideas how to tackle this one please?

Thanks!

Emil

Hello @emil

I did some searching and found a similar OpenStack ticket regarding kubernetes errors similar to this.
Here is what I was able to find!
https://stackoverflow.com/questions/61303668/kubernetes-readiness-probe-failed-dial-tcp-10-244-0-105000-connect-connectio
I hope this helps!

Hi @TrevorS,

I appreciate your response.

It turned out the neo4j pod could not mount the newly resized disk that's why it was failing.

The startup probe was failing correctly, but I didn't even think the disk could be the issue (despite pod volume mount errors) until I tried to mount it on a VM and got more specific mount errors. I thought it could be something related more to the neo4j config, but I was wrong in my assumptions.

neo4j pod volume mount error:

Warning FailedMount 81s kubelet
Unable to attach or mount volumes:
unmounted volumes=[data], unattached volumes=[neo4j-conf data kube-api-access-fmwws]: timed out waiting for the condition

Trying to mount the azure disk under ubuntu:

mount: /neo4j: wrong fs type, bad option, bad superblock on /dev/sdd, missing codepage or helper program, or other error.

Trying to fix the disk under ubuntu shows:

fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
ext2fs_open2: Bad magic number in super-block
fsck.ext2: Superblock invalid, trying backup blocks...
/dev/sdd: recovering journal
fsck.ext2: unable to set superblock flags on /dev/sdd

/dev/sdd: ***** FILE SYSTEM WAS MODIFIED *****

/dev/sdd: ********** WARNING: Filesystem still has errors **********

I hope this helps someone else with similar issue, however I would be interested if some has experience with any disk tools that could help recover from this azure disk failure.

Obviously the solution was to change the azure disk and to run ​helm update ...​

`Thanks again!

Emil