Description
NGINX Ingress controller version: v0.43.0 (linked from the deployment guide)
Kubernetes version: 1.18.9
Environment:
- Cloud provider configuration: AWS EKS (via
eksctl
)
What happened:
~4 minutes after a new node joins the cluster, requests to a service (which existed before the node was added) get stuck for 1-3 minutes, or exit with the following error:
Python: (requests==v2.24.0
):
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/urllib3/connection.py", line 144, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 83, in create_connection
raise err
File "/usr/lib/python3/dist-packages/urllib3/util/connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: [Errno 110] Connection timed out
Go (version 1.14.7):
Get "http://a0ddaf25bbd5e40388621bf6afe33c17-b1542fdcbd989302.elb.us-west-2.amazonaws.com": dial tcp 54.201.91.39:80: i/o timeout
No pod autoscaling is involved; there was one pod running on one node, and this happens when a new node is added, without any new pods getting requested or scheduled.
What you expected to happen:
Requests to the existing service should not hang or timeout when nodes join the cluster.
How to reproduce it:
# create a cluster with 1 node (uses k8s 1.18 by default)
eksctl create cluster --region us-west-2 --name test --node-type t3.small --nodes 1 --nodes-max 2 --nodegroup-name ng
# install ingress-nginx
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/controller-v0.43.0/deploy/static/provider/aws/deploy.yaml
# wait a minute, then deploy a sample deployment + service + ingress (see below for app.yaml):
kubectl apply -f app.yaml
# wait a minute, and then get the load balancer's hostname:
endpoint="http://$(kubectl get service -n ingress-nginx ingress-nginx-controller -o jsonpath='{.status.loadBalancer.ingress[0].hostname}')"
# after the load balancer is ready (5-10 min), check that the endpoint works (the response will include "Hello world"):
curl $endpoint
# start a script which makes ~2 requests per second:
while true; do time curl --silent $endpoint >/dev/null; sleep 0.5; done
# or using Python (see below for load.py):
python3 load.py $endpoint
# or using Go (see below for load.go):
go run load.go $endpoint
# in another terminal window, add a node to the cluster:
eksctl scale nodegroup --region us-west-2 --cluster test --name ng --nodes 2
# after the eksctl command returns, you can watch aws-node and kube-proxy get created on the new node:
watch kubectl get pods --namespace kube-system
After the new aws-node
and kube-proxy
pods show an AGE
of ~3-8 minutes, the request script hangs, usually for 60-260 seconds for curl
and Python (often it seems to be 130 seconds), and 10 seconds for go.
Here are the files I referenced above:
# app.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-api
spec:
replicas: 1
selector:
matchLabels:
app: my-api
template:
metadata:
labels:
app: my-api
spec:
containers:
- name: api
image: tiangolo/uvicorn-gunicorn-fastapi:python3.7
ports:
- containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
name: my-api
spec:
selector:
app: my-api
ports:
- protocol: TCP
port: 80
targetPort: 80
---
apiVersion: networking.k8s.io/v1beta1
kind: Ingress
metadata:
name: my-api
annotations:
kubernetes.io/ingress.class: nginx
spec:
rules:
- http:
paths:
- path: /
backend:
serviceName: my-api
servicePort: 80
# load.py
import requests
import sys
import time
url = sys.argv[1]
print("status code request time")
while True:
t = time.time()
response = requests.get(url)
print(f"{response.status_code} {time.time() - t}s")
time.sleep(0.5)
// load.go
package main
import (
"fmt"
"net/http"
"os"
"time"
)
func main() {
url := os.Args[1]
fmt.Println("status code request time")
for {
t := time.Now()
response, err := http.Get(url)
if err != nil {
panic(err)
}
response.Body.Close()
fmt.Printf("%d %s\n", response.StatusCode, time.Since(t))
time.Sleep(500 * time.Millisecond)
}
}
Anything else we need to know:
-
If I modify the service in the default ingress-nginx configuration by removing the
service.beta.kubernetes.io/aws-load-balancer-type
annotation (thereby using the classic ELB), and/or by removingexternalTrafficPolicy
(thereby defaulting toCluster
), then I do not observe this problem. However, I would like to use an NLB to support VPC Links in API Gateway, and I would like to useexternalTrafficPolicy: Local
to preserve the IP address and reduce the extra hop. Also, based on the deployment docs and default configuration, it seems like NLB +externalTrafficPolicy: Local
is the recommended approach. -
The impact is less extreme when running the same load script from go; requests will hang for ~10 seconds (at the same point in time when the requests from Python or
curl
hang for 1-3 minutes). Sometimes, it returns with the error I pasted above.
We have been wrestling with this for the past few days, so I'd love to hear anyone's thoughts on whether this is a bug in ingress-nginx
, a bug somewhere else, or a mistake I made (unlikely since I just used the defaults for everything).
Thank you!
/kind bug