Hello.
I’m facing a pretty critical issue and I’m pretty sure that this is a Neo4J bug.
There is already existing github issue for it Discovery binds on docker0 · Issue #12221 · neo4j/neo4j · GitHub, but I added more finds from my setup: Discovery binds on docker0 · Issue #12221 · neo4j/neo4j · GitHub
Long story short:
I’m trying to do a Neo4J Causal Cluster setup based on the AWS ECS with awsvpc
networking. So awsvpc
networking provides a separate network interface (with the separate IP from the VPC) mounted into Docker Image.
Here are the problems I’m facing:
-
Setting
causal_clustering.discovery_listen_address
doesn’t work properly. When it is set to0.0.0.0:5000
the connection info in logs says the expected infoDiscovery: listen=0.0.0.0:5000, advertised=10.10.1.128:5000
but on practice, running
lsof -i tcp | grep neo | grep LISTEN
displays, that while other ports listens properly on*:7000
or*:6000
and the Discovery port still being bound to the IP (as stated in the issue), for example:java 6 neo4j 233u IPv4 427591 0t0 TCP 169.254.172.28:5000 (LISTEN)
-
When Discovery port not being bound properly to the
*:5000
looks like it is being bound to the random available network interface. In my case ECS containers has two interfaces (excluding loopback):Interface ecs-eth0: address: 169.254.172.x
and
Interface eth0: address: 10.10.x.x
(this data coming from the
debug.log
file), where theecs-eth0
is some internal ECS interface I don’t care about and theeth0
is the one that should handle communication. The problem is, when neo4j binds the listen port toeth0
- everything works fine, whenecs-eth0
- port 5000 is unreachable and node can’t join the cluster. And this happens at complete RANDOM, see more logs in the github issue comment Discovery binds on docker0 · Issue #12221 · neo4j/neo4j · GitHub -
Setting
causal_clustering.discovery_listen_address=10.10.x.x:5000
doesn’t work as well. Even when I can see correct bind info from the
lsof
:TCP ip-10-10-1-72.ec2.internal:5000 (LISTEN)
and in the connection info log
Discovery: listen=10.10.1.72:5000, advertised=10.10.1.72:5000
it still doesn’t work (nodes simply not discovering each other, even if I can access port 5000 with
nc
) and it is a complete mystery to me why.
Thanks!