Skip to content

Swarm overlay network does not routing IP address (without userns-remap) after restart nodes #34165

@umegaya

Description

@umegaya

Description
I use 3 node swarm with 3 manager on AWS, each node created by docker-machine (ami-87b917e4)
after restart nodes, some of container cannot communicate each other via IP address and service name.

Steps to produce the issue:

  1. create network
docker network create --driver overlay --subnet 10.0.150.0/24 prod-nw
  1. create 4 backend service and 1 frontend service, which is global mode. note that each container has at least 1 publish setting (I omit fluent logger setting to simplify)
docker service create --name backend-1 --replicas 1 --with-registry-auth --network prod-nw --publish 8200:8082 $(backend-1-image)
docker service create --name backend-2 --replicas 1 --with-registry-auth --network prod-nw --publish 8201:8082 $(backend-2-image)
docker service create --name backend-3 --replicas 1 --with-registry-auth --network prod-nw --publish 8100:8082 $(backend-3-image) 
docker service create --name backend-4 --replicas 1 --with-registry-auth --network prod-nw --publish 8101:8082 $(backend-4-image)
docker service create --name frontend --mode global --publish mode=host,published=50051,target=50051 --with-registry-auth --network prod-nw --publish mode=host,published=8082,target=8082 $(frontend-image)
  1. after restart nodes, try to connect to the other service via the DNS entry/VIP

Describe the results you received:

  • each container had following IPs on prod-nw:
backend-1:  10.0.150.30
backend-2: 10.0.150.12
backend-3: 10.0.150.32
backend-4: 10.0.150.17
frontend-1: 10.0.150.15
frontend-2: 10.0.150.4
frontend-3: 10.0.150.9
  • most of connectivity work well except:
frontend-1 <-> backend-4
frontend-2 <-> backend-2
backend-2 -> frontend-3 (weird, because connection from frontend-3 to backend-2 seems to be established)
  • and if connectivity lost, even with direct IP, got following errors:
    • No route to host at 10.0.150.12 (backend-2 -> frontend-3)
    $ telnet 10.0.150.9 50051
    Trying 10.0.150.9...
    telnet: Unable to connect to remote host: No route to host
    $ netstat -an | grep ESTABLISHED # report connection established
    tcp        0      0 10.0.150.12:50051       10.0.150.9:53242        ESTABLISHED
    tcp        0      0 10.0.150.12:50051       10.0.150.15:55472       ESTABLISHED
    
    • Connection timed out at 10.0.150.17 (backend-4 -> frontend-1)
    telnet 10.0.150.15 50051
    Trying 10.0.150.15...
    telnet: Unable to connect to remote host: Connection timed out
    

Describe the results you expected:
I expected to be able to connect to the service using the VIP created for the service and route accordingly.

Additional information you deem important (e.g. issue happens only occasionally):
its similar to #26106, but a few difference, so suggested to create as new issue:

  • using docker-machine created AWS docker instance (ubuntu 16.04 LTS)
  • I do not explicitly specify userns-remap setting (I'm not sure implicitly set)
  • not only container name, but also specifying direct IP does not work (No route to host)

Output of docker version:

Client:
 Version:      17.03.1-ce
 API version:  1.27
 Go version:   go1.7.5
 Git commit:   c6d412e
 Built:        Mon Mar 27 17:14:09 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.05.0-ce
 API version:  1.29 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   89658be
 Built:        Thu May  4 22:10:54 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

Containers: 55
 Running: 7
 Paused: 0
 Stopped: 48
Images: 74
Server Version: 17.05.0-ce
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 281
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: active
 NodeID: kcpanuat85bztrvktep186fg8
 Is Manager: true
 ClusterID: qclswzn5foalbgmlkhh2e95i6
 Managers: 3
 Nodes: 3
 Orchestration:
  Task History Retention Limit: 5
 Raft:
  Snapshot Interval: 10000
  Number of Old Snapshots to Retain: 0
  Heartbeat Tick: 1
  Election Tick: 3
 Dispatcher:
  Heartbeat Period: 5 seconds
 CA Configuration:
  Expiry Duration: 3 months
 Node Address: 172.32.11.239
 Manager Addresses:
  172.32.11.239:2377
  172.32.11.40:2377
  172.32.2.28:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 9048e5e50717ea4497b757314bad98ea3763c145
runc version: 9c2d8d184e5da67c95d601382adf14862e4f2228
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-79-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 15.67GiB
Name: swarm-master
ID: YAOV:4AKS:YOJL:GKDF:HHTV:XW24:ZMOI:M7HU:7T2Q:E5PZ:5KW4:45FI
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
 provider=amazonec2
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
AWS, 3 node swarm, 3 manager, each node created by docker-machine (ami-87b917e4)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions