Requests hang or timeout when adding nodes to cluster (AWS EKS + NLB)

**NGINX Ingress controller version**: v0.43.0 (linked from the [deployment guide](https://kubernetes.github.io/ingress-nginx/deploy/#network-load-balancer-nlb))

**Kubernetes version**: 1.18.9

**Environment**:

- **Cloud provider configuration**: AWS EKS (via `eksctl`)

**What happened**:

~4 minutes after a new node joins the cluster, requests to a service (which existed before the node was added) get stuck for 1-3 minutes, or exit with the following error:

Python: (`requests==v2.24.0`):

```text
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 144, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 83, in create_connection
    raise err
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
```

Go (version 1.14.7):

```text
Get "http://a0ddaf25bbd5e40388621bf6afe33c17-b1542fdcbd989302.elb.us-west-2.amazonaws.com": dial tcp 54.201.91.39:80: i/o timeout
```

No pod autoscaling is involved; there was one pod running on one node, and this happens when a new node is added, without any new pods getting requested or scheduled.

**What you expected to happen**:

Requests to the existing service should not hang or timeout when nodes join the cluster.

**How to reproduce it**:

```bash
# create a cluster with 1 node (uses k8s 1.18 by default)
eksctl create cluster --region us-west-2 --name test --node-type t3.small --nodes 1 --nodes-max 2 --nodegroup-name ng

# install ingress-nginx
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.43.0/deploy/static/provider/aws/deploy.yaml

# wait a minute, then deploy a sample deployment + service + ingress (see below for app.yaml):
kubectl apply -f app.yaml

# wait a minute, and then get the load balancer's hostname:
endpoint="http://$(kubectl get service -n ingress-nginx ingress-nginx-controller -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')"

# after the load balancer is ready (5-10 min), check that the endpoint works (the response will include "Hello world"):
curl $endpoint

# start a script which makes ~2 requests per second:
while true; do time curl --silent $endpoint >/dev/null; sleep 0.5; done
# or using Python (see below for load.py):
python3 load.py $endpoint
# or using Go (see below for load.go):
go run load.go $endpoint

# in another terminal window, add a node to the cluster:
eksctl scale nodegroup --region us-west-2 --cluster test --name ng --nodes 2

# after the eksctl command returns, you can watch aws-node and kube-proxy get created on the new node:
watch kubectl get pods --namespace kube-system
```

After the new `aws-node` and `kube-proxy` pods show an `AGE` of ~3-8 minutes, the request script hangs, usually for 60-260 seconds for `curl` and Python (often it seems to be 130 seconds), and 10 seconds for go.

Here are the files I referenced above:

```yaml
# app.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-api
  template:
    metadata:
      labels:
        app: my-api
    spec:
      containers:
        - name: api
          image: tiangolo/uvicorn-gunicorn-fastapi:python3.7
          ports:
            - containerPort: 80

---
apiVersion: v1
kind: Service
metadata:
  name: my-api
spec:
  selector:
    app: my-api
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: my-api
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
    - http:
        paths:
          - path: /
            backend:
              serviceName: my-api
              servicePort: 80
```

```python
# load.py

import requests
import sys
import time

url = sys.argv[1]
print("status code  request time")

while True:
    t = time.time()
    response = requests.get(url)
    print(f"{response.status_code}          {time.time() - t}s")
    time.sleep(0.5)
```

```go
// load.go

package main

import (
	"fmt"
	"net/http"
	"os"
	"time"
)

func main() {
	url := os.Args[1]
	fmt.Println("status code  request time")

	for {
		t := time.Now()
		response, err := http.Get(url)
		if err != nil {
			panic(err)
		}
		response.Body.Close()
		fmt.Printf("%d          %s\n", response.StatusCode, time.Since(t))
		time.Sleep(500 * time.Millisecond)
	}
}
```

**Anything else we need to know**:

* If I modify the service in the default ingress-nginx configuration by removing the `service.beta.kubernetes.io/aws-load-balancer-type` annotation (thereby using the classic ELB), and/or by removing `externalTrafficPolicy` (thereby defaulting to `Cluster`), then I do not observe this problem. However, I would like to use an NLB to support VPC Links in API Gateway, and I would like to use `externalTrafficPolicy: Local` to preserve the IP address and reduce the extra hop. Also, based on the deployment docs and default configuration, it seems like NLB + `externalTrafficPolicy: Local` is the recommended approach.

* The impact is less extreme when running the same load script from go; requests will hang for ~10 seconds (at the same point in time when the requests from Python or `curl` hang for 1-3 minutes). Sometimes, it returns with the error I pasted above.

We have been wrestling with this for the past few days, so I'd love to hear anyone's thoughts on whether this is a bug in `ingress-nginx`, a bug somewhere else, or a mistake I made (unlikely since I just used the defaults for everything).

Thank you!

/kind bug


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Requests hang or timeout when adding nodes to cluster (AWS EKS + NLB) #6828

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Requests hang or timeout when adding nodes to cluster (AWS EKS + NLB) #6828

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions