Hello.
I’m facing a pretty critical issue and I’m pretty sure that this is a Neo4J bug.
There is already existing github issue for it Discovery binds on docker0 · Issue #12221 · neo4j/neo4j · GitHub, but I added more finds from my setup: Discovery binds on docker0 · Issue #12221 · neo4j/neo4j · GitHub
Long story short:
I’m trying to do a Neo4J Causal Cluster setup based on the AWS ECS with awsvpc networking. So awsvpc networking provides a separate network interface (with the separate IP from the VPC) mounted into Docker Image.
Here are the problems I’m facing:
-
Setting
causal_clustering.discovery_listen_addressdoesn’t work properly. When it is set to0.0.0.0:5000the connection info in logs says the expected infoDiscovery: listen=0.0.0.0:5000, advertised=10.10.1.128:5000but on practice, running
lsof -i tcp | grep neo | grep LISTENdisplays, that while other ports listens properly on*:7000or*:6000and the Discovery port still being bound to the IP (as stated in the issue), for example:java 6 neo4j 233u IPv4 427591 0t0 TCP 169.254.172.28:5000 (LISTEN) -
When Discovery port not being bound properly to the
*:5000looks like it is being bound to the random available network interface. In my case ECS containers has two interfaces (excluding loopback):Interface ecs-eth0: address: 169.254.172.xand
Interface eth0: address: 10.10.x.x(this data coming from the
debug.logfile), where theecs-eth0is some internal ECS interface I don’t care about and theeth0is the one that should handle communication. The problem is, when neo4j binds the listen port toeth0- everything works fine, whenecs-eth0- port 5000 is unreachable and node can’t join the cluster. And this happens at complete RANDOM, see more logs in the github issue comment Discovery binds on docker0 · Issue #12221 · neo4j/neo4j · GitHub -
Setting
causal_clustering.discovery_listen_address=10.10.x.x:5000doesn’t work as well
. Even when I can see correct bind info from the lsof:TCP ip-10-10-1-72.ec2.internal:5000 (LISTEN)and in the connection info log
Discovery: listen=10.10.1.72:5000, advertised=10.10.1.72:5000it still doesn’t work (nodes simply not discovering each other, even if I can access port 5000 with
nc) and it is a complete mystery to me why.
Thanks!