Skip to content

Requests hang or timeout when adding nodes to cluster (AWS EKS + NLB) #6828

Closed
@deliahu

Description

@deliahu

NGINX Ingress controller version: v0.43.0 (linked from the deployment guide)

Kubernetes version: 1.18.9

Environment:

  • Cloud provider configuration: AWS EKS (via eksctl)

What happened:

~4 minutes after a new node joins the cluster, requests to a service (which existed before the node was added) get stuck for 1-3 minutes, or exit with the following error:

Python: (requests==v2.24.0):

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 144, in _new_conn
    (self.host, self.port), self.timeout, **extra_kw)
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 83, in create_connection
    raise err
  File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 73, in create_connection
    sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out

Go (version 1.14.7):

Get "http://a0ddaf25bbd5e40388621bf6afe33c17-b1542fdcbd989302.elb.us-west-2.amazonaws.com": dial tcp 54.201.91.39:80: i/o timeout

No pod autoscaling is involved; there was one pod running on one node, and this happens when a new node is added, without any new pods getting requested or scheduled.

What you expected to happen:

Requests to the existing service should not hang or timeout when nodes join the cluster.

How to reproduce it:

# create a cluster with 1 node (uses k8s 1.18 by default)
eksctl create cluster --region us-west-2 --name test --node-type t3.small --nodes 1 --nodes-max 2 --nodegroup-name ng

# install ingress-nginx
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.43.0/deploy/static/provider/aws/deploy.yaml

# wait a minute, then deploy a sample deployment + service + ingress (see below for app.yaml):
kubectl apply -f app.yaml

# wait a minute, and then get the load balancer's hostname:
endpoint="http://$(kubectl get service -n ingress-nginx ingress-nginx-controller -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')"

# after the load balancer is ready (5-10 min), check that the endpoint works (the response will include "Hello world"):
curl $endpoint

# start a script which makes ~2 requests per second:
while true; do time curl --silent $endpoint >/dev/null; sleep 0.5; done
# or using Python (see below for load.py):
python3 load.py $endpoint
# or using Go (see below for load.go):
go run load.go $endpoint

# in another terminal window, add a node to the cluster:
eksctl scale nodegroup --region us-west-2 --cluster test --name ng --nodes 2

# after the eksctl command returns, you can watch aws-node and kube-proxy get created on the new node:
watch kubectl get pods --namespace kube-system

After the new aws-node and kube-proxy pods show an AGE of ~3-8 minutes, the request script hangs, usually for 60-260 seconds for curl and Python (often it seems to be 130 seconds), and 10 seconds for go.

Here are the files I referenced above:

# app.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-api
spec:
  replicas: 1
  selector:
    matchLabels:
      app: my-api
  template:
    metadata:
      labels:
        app: my-api
    spec:
      containers:
        - name: api
          image: tiangolo/uvicorn-gunicorn-fastapi:python3.7
          ports:
            - containerPort: 80

---
apiVersion: v1
kind: Service
metadata:
  name: my-api
spec:
  selector:
    app: my-api
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80

---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
  name: my-api
  annotations:
    kubernetes.io/ingress.class: nginx
spec:
  rules:
    - http:
        paths:
          - path: /
            backend:
              serviceName: my-api
              servicePort: 80
# load.py

import requests
import sys
import time

url = sys.argv[1]
print("status code  request time")

while True:
    t = time.time()
    response = requests.get(url)
    print(f"{response.status_code}          {time.time() - t}s")
    time.sleep(0.5)
// load.go

package main

import (
	"fmt"
	"net/http"
	"os"
	"time"
)

func main() {
	url := os.Args[1]
	fmt.Println("status code  request time")

	for {
		t := time.Now()
		response, err := http.Get(url)
		if err != nil {
			panic(err)
		}
		response.Body.Close()
		fmt.Printf("%d          %s\n", response.StatusCode, time.Since(t))
		time.Sleep(500 * time.Millisecond)
	}
}

Anything else we need to know:

  • If I modify the service in the default ingress-nginx configuration by removing the service.beta.kubernetes.io/aws-load-balancer-type annotation (thereby using the classic ELB), and/or by removing externalTrafficPolicy (thereby defaulting to Cluster), then I do not observe this problem. However, I would like to use an NLB to support VPC Links in API Gateway, and I would like to use externalTrafficPolicy: Local to preserve the IP address and reduce the extra hop. Also, based on the deployment docs and default configuration, it seems like NLB + externalTrafficPolicy: Local is the recommended approach.

  • The impact is less extreme when running the same load script from go; requests will hang for ~10 seconds (at the same point in time when the requests from Python or curl hang for 1-3 minutes). Sometimes, it returns with the error I pasted above.

We have been wrestling with this for the past few days, so I'd love to hear anyone's thoughts on whether this is a bug in ingress-nginx, a bug somewhere else, or a mistake I made (unlikely since I just used the defaults for everything).

Thank you!

/kind bug

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.lifecycle/rottenDenotes an issue or PR that has aged beyond stale and will be auto-closed.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions