Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image #3457

EcaterinaGr · 2018-11-22T14:00:44Z

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):

Is this a BUG REPORT or FEATURE REQUEST? (choose one):

NGINX Ingress controller version:
0.20.0

Kubernetes version (use kubectl version):
1.9.6

Environment:
Production

Cloud provider or hardware configuration:
Azure
OS (e.g. from /etc/os-release):
Ubuntu 16.04
Kernel (e.g. uname -a):
Linux k8s-master-81594228-1 4.13.0-1012-azure Split implementations from generic code #15-Ubuntu SMP Thu Mar 8 10:47:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
Install tools:
acs engine, ansible, terraform
Others:

What happened:
After upgrading the Nginx controller version from 0.15.0 to 0.20.0, the nginx ingress controller pods are regularly crashing after several timeout messages on the liveness probe. The nginx ingress controller pods are installed on separate VMs as all other pods. We need 0.20.0 version because we want to activate use-forwarded-headers: "false" in nginx config map to avoid the security vulnerability (user forging the headers to bypass the whitelist of nginx).

What you expected to happen:
Stable behavior of nginx ingress controller pods as in version 0.15.0.

How to reproduce it (as minimally and precisely as possible):
Update the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.15.0 to
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0 on existent nginx ingress controller deployment.

Anything else we need to know:
Logs from the events:
2018-11-21 17:24:25 +0100 CET 2018-11-21 17:23:05 +0100 CET 6 nginx-ingress-controller-7d47db4569-9bxtz.1569303cf3aebbba Pod spec.containers{nginx-ingress-controller} Warning Unhealthy kubelet, k8s-dmz-81594228-0 Liveness probe failed: Get http://xx.xx.xx.xx:10254/healthz:: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-11-21 17:24:26 +0100 CET 2018-11-21 17:24:26 +0100 CET 1 nginx-ingress-controller-7d47db4569-9bxtz.1569304fae92655c Pod spec.containers{nginx-ingress-controller} Normal Killing kubelet, k8s-dmz-81594228-0 Killing container with id docker://nginx-ingress-controller:Container failed liveness probe.. Container will be killed and recreated.

We have tried to increase the timeoutSeconds on liveness probe to 4s and also to add - --enable-dynamic-configuration=false in the nginx deployment.
With this configuration, the number of timeouts decreased, but after a certain charge from apps on the platform, the timeouts become more regular.

Logs from nginx pods in debug mode and timeout 3sec:
{"log":"E1121 13:30:34.808413 5 controller.go:232] Error getting ConfigMap "kube-system/udp-services": no object matching key "kube-system/udp-services" in local store\n","stream":"stderr","time":"2018-11-21T13:30:34.818557076Z"}
{"log":"I1121 13:30:37.500168 5 main.go:158] Received SIGTERM, shutting down\n","stream":"stderr","time":"2018-11-21T13:30:37.501123038Z"}
{"log":"I1121 13:30:37.500229 5 nginx.go:340] Shutting down controller queues\n","stream":"stderr","time":"2018-11-21T13:30:37.501167238Z"}
{"log":"I1121 13:30:37.500276 5 nginx.go:348] Stopping NGINX process\n","stream":"stderr","time":"2018-11-21T13:30:37.501203538Z"}

The text was updated successfully, but these errors were encountered:

aledbf · 2018-11-22T14:21:58Z

@EcaterinaGr please post the start of the log. Maybe you are being affected by you should follow Azure/AKS#435

EcaterinaGr · 2018-11-22T14:52:05Z

@aledbf , are you asking for the start of events logs or of the pod itself?
Actually before SIGTERM, we can see the normal logs on the requests done on controllers and no other errors. The SIGTERM is given by the Kubenet (as we can see from the events) but we cannot identify the reason.
We are not using AKS, we use ACS engine and nginx was working well on our platform in the previous version 0.15.0.

Grabbed more logs from the docker container itself before it was killed:
2018/11/21 19:30:15 [alert] 2187#2187: *84403 open socket #47 left in connection 19
2018/11/21 19:30:15 [alert] 2187#2187: aborting
W1121 19:30:22.655014 6 controller.go:773] service utm-demo/todoapi-svc-demo does not have any active endpoints
W1121 19:30:22.655191 6 controller.go:773] service aviation-data-prod/rawstorage-service does not have any active endpoints
W1121 19:30:22.655466 6 controller.go:773] service atfm-dev/innove-gateway-service does not have any active endpoints
W1121 19:30:22.657725 6 controller.go:773] service utm-dev/utm-backend-api-svc-test does not have any active endpoints
W1121 19:30:22.658510 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.658665 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.658768 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.659129 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
E1121 19:30:22.660605 6 controller.go:221] unexpected error reading configmap kube-system/tcp-services: configmap kube-system/tcp-services was not found
E1121 19:30:22.660670 6 controller.go:221] unexpected error reading configmap kube-system/udp-services: configmap kube-system/udp-services was not found
I1121 19:30:25.914652 6 main.go:127] Received SIGTERM, shutting down
I1121 19:30:25.914678 6 nginx.go:362] shutting down controller queues
I1121 19:30:25.914709 6 status.go:119] updating status of Ingress rules (remove)
I1121 19:30:25.954696 6 nginx.go:370] stopping NGINX process...
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1400
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1400
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1416
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1416
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:8990
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:8990
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:9006
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:9006
2018/11/21 19:30:25 [notice] 2401#2401: signal process started

Demonsthere · 2018-11-23T10:35:16Z

Hi,
I wanted to note, that I had the same constant crashing with 0.20 on GCP. I updated to 0.21 just now, and it is better, but the crashes still occur. This time I saw in the logs:

I1123 10:25:15.605676       9 backend_ssl.go:189] Updating local copy of SSL certificate "default/prow-tls" with missing intermediate CA certs
I1123 10:29:47.579872       9 main.go:163] Received SIGTERM, shutting down
I1123 10:29:47.579901       9 nginx.go:343] Shutting down controller queues
I1123 10:29:47.579917       9 status.go:200] updating status of Ingress rules (remove)

fntlnz · 2018-11-29T09:19:03Z

Related to this: #2833

EcaterinaGr · 2018-12-11T10:53:47Z

hello

Any updates regarding the issue?

Flowkap · 2018-12-21T12:32:03Z

As soon as we add certificates (same creation as on our dev clusters) nginx starts to crash.
As long as one is alive it actually works! But is does not take too long to finally crash all instances :/
We do not observer this on Kubernetes 1.11.3 but on 1.12.3

katanyoussef · 2019-01-05T13:19:04Z

solution for this bug ? plz thanks

Globegitter · 2019-01-18T15:07:08Z

We have been experiencing similar issues where we see:

  Warning  Unhealthy  8m5s (x12 over 20m)    kubelet, ip-xxx-eu-central-1.compute.internal  Readiness probe failed: Get http://100.96.5.100:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  8m1s (x11 over 20m)    kubelet, ip-xxx.eu-central-1.compute.internal  Liveness probe failed: Get http://100.96.5.100:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

much more frequently for some of our nginx pods than we did in the past and it is leading to public facing errors, sometimes it seems to manifest as cert errors (we do TLS termination with nginx). I don't know exactly when it started but we started noticing it when we where on 0.20.0 and now we are on 0.22.0 and we are still seeing these issues.

Our probes timeout is set to 1s and we also do not seem to see any error in the nginx controller itself. And as of this time we have no idea why this is happening.

We are running on AWS with a kops cluster on k8s 1.11.6.

ElvinEfendi · 2019-01-18T15:19:37Z

@Globegitter how does memory and cpu utilization look like during that time? It could be network related too.

If it is feasible you can also try to downgrade to the earlier ingress version where it worked well for you and confirm that it is because if ingress-nginx.

Globegitter · 2019-01-18T15:35:31Z

@ElvinEfendi You mean specifically of these nginx pods? Or you mean node/cluster wide? But yeah we still have to investigate further and we have been serving quite a bit more traffic since over a month ago. I just found this issues seemed interestingly fitting but could very much just be coincidence.

Unfortunately we are using a 0.20.0 specific feature so downgrading is going to be a bit more difficult, but depending on how this continues we'll definitely be doing some investigation on that next week.

ElvinEfendi · 2019-01-18T15:41:21Z

You mean specifically of these nginx pods

I guess both. I'm suggesting that to make sure timeout is not because the pod is too busy doing other things and therefore can not respond health checks timely.

Globegitter · 2019-01-18T15:46:49Z

@ElvinEfendi yeah that is a good pointer: The last 2 hours or so we only ever had on pod crashing a fair bit. and it is that pod that has a much higher cpu usage and memory usage than the other pods. So that is very interesting, as to why that one pod would have a much higher usage than our other 3 pods? Further we also have a cluster wide traffic of <2000req/s, so even if that was just hitting one nginx I would still expect that to be able to respond to health probes.

aledbf · 2019-01-18T15:53:44Z

@Globegitter maybe you can help us to test this running some scripts :)

Get information about the running pods:

kubectl get pod -o wide -n ingress-nginx

SSH to a node in the cluster
Build a list with the IP addresses and run a script

POD_IPS="XX.XXX.XXX XX.XXX.XXX" # using the IPs from kubectl
while true;do
  sleep 1
  for x in $POD_IPS;do
    echo "[$(date)] http://$x:10254/healthz http status:" $(curl -I -o /dev/null -s -w "%{http_code}::%{time_namelookup}::%{time_connect}::%{time_starttransfer}::%{time_total}\n" http://$x:10254/healthz);
  done
done

This will print something like


[Fri Jan 18 15:47:33 UTC 2019] http://100.96.9.248:10254/healthz http status: 200::0.000029::0.000112::0.001546::0.001582
[Fri Jan 18 15:47:33 UTC 2019] http://100.96.3.200:10254/healthz http status: 200::0.000023::0.000817::0.002746::0.002782

This can help us to test what @ElvinEfendi said in the previous comment

Edit: maybe I should put this in a k8s Job to help to debug this issue.

Globegitter · 2019-01-18T16:05:13Z

@aledbf Aah yeah that is a good idea on debugging - we just raised the memory limit and the pods are re-rolling. But if the timeout issue persists / pops up again I will make sure to make use of the debug script and post the results here.

Edit: And a job for that would really be useful.

Globegitter · 2019-01-18T16:16:17Z

@aledbf it is still happening, here some output:

[Fri Jan 18 16:14:08 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000031::0.003766::0.011774::0.011802
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.29.19:10254/healthz http status: 000::0.000026::0.000000::0.000000::0.000084
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.000289::2.548471::2.548508
[Fri Jan 18 16:14:10 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000719::0.010645::0.010668
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000742::0.002636::0.002657
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000024::0.000305::0.001203::0.001222
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000017::0.000239::2.434425::2.434456
[Fri Jan 18 16:14:14 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000700::0.003581::0.003603
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000023::0.000735::0.002499::0.002522
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000260::0.004873::0.004893
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000015::0.000253::2.936678::2.936709
[Fri Jan 18 16:14:18 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000024::0.000730::0.005554::0.005575
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000835::0.004889::0.004912
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000029::0.000293::0.004889::0.004909
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000015::0.000245::0.000943::0.000963
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000021::0.001020::0.003824::0.003846
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000713::0.004364::0.004391
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000036::0.000363::0.001800::0.001818
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000263::0.001097::0.001117
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000684::0.002500::0.002515
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000019::0.000715::0.002270::0.002295
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.002707::0.005841::0.005860
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000243::0.000742::0.000760
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000016::0.004325::0.006078::0.006114
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000029::0.000797::0.007521::0.007544
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000019::0.000263::0.003320::0.003340
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000252::0.001245::0.001263
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.000680::0.002750::0.002767
[Fri Jan 18 16:14:23 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000723::0.002503::0.002525
[Fri Jan 18 16:14:23 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000016::0.000238::1.813099::1.813174
[Fri Jan 18 16:14:25 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000017::0.000262::0.001793::0.001815
[Fri Jan 18 16:14:25 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000021::0.000799::0.002882::0.002901
[Fri Jan 18 16:14:26 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000738::0.002451::0.002471
[Fri Jan 18 16:14:26 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000259::3.507546::3.507583
[Fri Jan 18 16:14:29 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000029::0.000321::0.000890::0.000914
[Fri Jan 18 16:14:29 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000026::0.000910::0.004054::0.004078
[Fri Jan 18 16:14:30 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000028::0.000781::0.158132::0.158176
[Fri Jan 18 16:14:31 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000326::2.694099::2.694130
[Fri Jan 18 16:14:33 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000023::0.000311::0.004423::0.004448
[Fri Jan 18 16:14:33 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000742::0.003229::0.003247
[Fri Jan 18 16:14:34 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000024::0.000782::0.004251::0.004272
[Fri Jan 18 16:14:34 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000241::2.955700::2.955730
[Fri Jan 18 16:14:37 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000019::0.000262::0.001195::0.001246
[Fri Jan 18 16:14:37 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.001047::0.004546::0.004565
[Fri Jan 18 16:14:38 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000712::0.002325::0.002363
[Fri Jan 18 16:14:38 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000268::2.350015::2.350049
[Fri Jan 18 16:14:41 UTC 2019] http://100.96.29.81:10254/healthz http status: 000::0.000023::0.000000::0.000000::0.000069
[Fri Jan 18 16:14:41 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000712::0.005419::0.005438
[Fri Jan 18 16:14:42 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.001377::0.004157::0.004178
[Fri Jan 18 16:14:42 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000015::0.000240::2.529109::2.529148
[Fri Jan 18 16:14:44 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000019::0.000258::0.000976::0.000996
[Fri Jan 18 16:14:44 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000737::0.007455::0.007473
[Fri Jan 18 16:14:45 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000763::0.003782::0.003808
[Fri Jan 18 16:14:45 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000015::0.000321::1.605230::1.605260
[Fri Jan 18 16:14:47 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000019::0.000263::0.003015::0.003037
[Fri Jan 18 16:14:47 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000039::0.000741::0.006094::0.006114
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000708::0.002475::0.002499
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000017::0.000250::0.000817::0.000833
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000040::0.000250::0.008481::0.008504
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.000694::0.004170::0.004189
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000017::0.000761::0.008205::0.008231
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000019::0.000414::0.001783::0.001802
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000014::0.000211::0.504140::0.504379
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000023::0.000717::0.004022::0.004043
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000740::0.002447::0.002472
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000018::0.000275::0.001992::0.002008
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000020::0.000231::3.389365::3.389403
[Fri Jan 18 16:14:54 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000022::0.000727::0.004877::0.004900
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000815::0.002551::0.002571
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000016::0.000264::0.000845::0.000871
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000016::0.000233::2.998702::2.998749
[Fri Jan 18 16:14:58 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000024::0.000761::0.011797::0.011820
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000724::0.002941::0.002962
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000021::0.000262::0.001866::0.001886
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000025::0.000298::2.891728::2.891762
[Fri Jan 18 16:15:02 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000716::0.002861::0.002882
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000025::0.002314::0.006595::0.006644
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000027::0.001592::0.002399::0.002424
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000537::3.195221::3.195255
[Fri Jan 18 16:15:06 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000738::0.002648::0.002672
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000022::0.000738::0.003204::0.003243
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000019::0.000262::0.002440::0.002463
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000020::0.000231::2.069995::2.070088
[Fri Jan 18 16:15:09 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000019::0.000710::0.002243::0.002261
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000766::0.020720::0.020747
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.29.19:10254/healthz http status: 000::0.000019::0.000000::0.000000::0.000069
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000019::0.000242::2.151019::2.151056
[Fri Jan 18 16:15:12 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000019::0.000688::0.002563::0.002586
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000727::0.008094::0.008117
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000016::0.000286::0.001543::0.001564
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000014::0.000244::1.980282::1.980323
[Fri Jan 18 16:15:15 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000055::0.000778::0.004642::0.004668
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000025::0.000897::0.003045::0.003076
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000323::0.006020::0.006047
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.000308::2.500127::2.500182
[Fri Jan 18 16:15:19 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000020::0.000723::0.007644::0.007663
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000028::0.000737::0.002864::0.002886
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000280::0.003027::0.003052
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000303::1.673258::1.673299
[Fri Jan 18 16:15:22 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000034::0.002531::0.004833::0.004939
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000031::0.000776::0.007756::0.007789
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000026::0.000483::0.005270::0.005308
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000281::0.274060::0.274095
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000025::0.000857::0.020886::0.020911
[Fri Jan 18 16:15:24 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000024::0.000709::0.002947::0.002970
[Fri Jan 18 16:15:24 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000018::0.000241::3.363062::3.363098
[Fri Jan 18 16:15:27 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.001813::0.009423::0.009447
[Fri Jan 18 16:15:27 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000023::0.000943::0.037287::0.037332
[Fri Jan 18 16:15:28 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000745::0.003082::0.003106

Some of the 500s mixed in are from when the pod was taken out of service from the k8s probes, but interesting to see that health checks sometimes take up to 3.5s.

aledbf · 2019-01-18T16:36:02Z

@Globegitter please check the log of the pod with IP 100.96.29.19.

Globegitter · 2019-01-18T17:03:47Z

Just checked, we do not have a logging collector active yet for the nginx logs so only see part of the logs of the previous pod that errored. I do see a few 2019/01/18 16:34:00 [error] 139#139: *12429 upstream prematurely closed connection while reading response header from upstream, client: xxxxxx, server: xxxxx, request: "GET /about HTTP/2.0", upstream: "http://100.96.29.102:8001/about", host: "xxxxx", referrer: "xxxxxxx"

and at the end:

I0118 16:34:08.456836       7 nginx.go:385] NGINX process has stopped
I0118 16:34:08.456938       7 main.go:175] Handled quit, awaiting Pod deletion
I0118 16:34:18.457336       7 main.go:178] Exiting with 0

Otherwise it is just 900+ lines of normal json log of requests being made.

I can take a look next week at exporting the logs, or tailing them at the same time. Would you expect to see something in the error log? Or just the normal response logging?

aledbf · 2019-01-21T14:37:19Z

This issue is hard to reproduce, for this reason, I created #3684 to see if we can narrow the scope issue. Basically, I added additional logs in the health check inside the ingress controller to detect where we receive an error.

Please, help us using the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

Note: for security reasons, the probes do not show the exact cause of the failure unless you increase the log level of kubelet to a value higher than five (this is the reason why I added the additional logs)

Note 2: this PR will never be merged for the same reason, it's just to track the reason

Globegitter · 2019-01-21T15:08:00Z

@aledbf Thanks I will start investigating now, on Friday we could resolve the issue by increasing the number of replicas. It started happening again. Will test this image.

Globegitter · 2019-01-21T15:22:33Z

So, we have some output:

E0121 15:17:44.742314       7 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
...
...
I0121 15:17:48.427437       7 nginx.go:385] NGINX process has stopped
I0121 15:17:48.428097       7 main.go:175] Handled quit, awaiting Pod deletion
E0121 15:17:54.742273       7 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
I0121 15:17:58.428663       7 main.go:178] Exiting with 0

for another pod:

E0121 15:15:06.599845       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
...
...
I0121 15:15:09.895122       6 nginx.go:385] NGINX process has stopped
I0121 15:15:09.895267       6 main.go:175] Handled quit, awaiting Pod deletion
E0121 15:15:16.600000       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
I0121 15:15:19.896690       6 main.go:178] Exiting with 0

But again that is just from the logs that I could tail, but at least it is giving us something more. Also I am currently tailing all running pods with a grep for healthcheck so if this happens again we should see any of the logs there.

aledbf · 2019-01-21T15:33:43Z

Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080

This is "normal" because the health check starts before nginx. You should see only one of this error in the log.

E0121 15:17:44.742314
....
E0121 15:17:54.742273

This is not ok. This output means the ingress controller startup (nginx binary) takes more than ten seconds for the initial sync, which is not normal.

Globegitter · 2019-01-21T15:40:14Z

yep all logs the same:

E0121 15:37:26.600460       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:36.601935       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:46.600979       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:56.600502       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:38:06.600290       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller public-nginx-ingress-controller-6d4c798cf6-tt7z9 E0121 15:38:16.600088       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-kll8s nginx-ingress-controller E0121 15:38:35.487322       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
E0121 15:38:45.536244       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
E0121 15:38:55.487948       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-kll8s nginx-ingress-controller E0121 15:39:05.487299       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused

Here again, I just looked over the logs of all the running pods and jsut grepped for healthcheck.

Edit: Ah interesting, I will try and keep an eye out specifically on startup - some of these definitely happen to instances that have been running well for a while but we have not looked at startup "issues" yet.

aledbf · 2019-01-21T15:47:08Z

@Globegitter please check and post the generated nginx.conf

kubectl exec -n ingress-nginx <pod> -it cat nginx.conf |grep 18080

Is IPV6 enabled? (or only IPV6)
What CNI provider are you using?

aledbf · 2019-01-25T23:27:53Z

To those affected by this issue:

Please help us to test a fix for this with #3684 using the image
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

The mentioned PR contains a refactoring of the nginx server used for health-check and Lua configuration replacing the TCP port with a unix socket.

mdgreenwald · 2019-02-04T20:52:33Z

Hi, we were affected by this issue. We saw 150 connections per second would cause our ingress controller to restart. When we checked the resource it was failing health checks just as described here in this issue. We would sometimes see timeouts as high as 10.x seconds.

We tested again with:
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

This dev image fully resolved our issue, we are no longer seeing restarts at high connection rates.

sslavic · 2020-07-06T00:06:07Z

Do I get it right, #4531 (shipped with 0.26.0) practically reverted changes made for this issue in #3684 (shipped with 0.23.0) ?

aledbf · 2020-07-06T00:26:38Z

@sslavic yes

sslavic · 2020-07-06T09:16:02Z

This could improve situation #5832 once released.

Demonsthere mentioned this issue Nov 23, 2018

Regular crashes continued #3459

Closed

aledbf mentioned this issue Jan 28, 2019

Replace Status port using a socket #3684

Merged

k8s-ci-robot closed this as completed in #3684 Feb 6, 2019

aledbf mentioned this issue Feb 6, 2019

Adjust probe timeouts #3737

Merged

aledbf mentioned this issue Mar 15, 2019

kubectl ingress-nginx plugin improvements #3831

Closed

3 tasks

ac-hibbert mentioned this issue Apr 25, 2019

ingress-nginx crashes on reload of configuration #4041

Closed

ac-hibbert mentioned this issue Jul 8, 2019

ingress-nginx crashes on reload of configuration #4284

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image #3457

Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image #3457

EcaterinaGr commented Nov 22, 2018 •

edited

Loading

aledbf commented Nov 22, 2018

EcaterinaGr commented Nov 22, 2018 •

edited

Loading

Demonsthere commented Nov 23, 2018

fntlnz commented Nov 29, 2018

EcaterinaGr commented Dec 11, 2018

Flowkap commented Dec 21, 2018

katanyoussef commented Jan 5, 2019

Globegitter commented Jan 18, 2019 •

edited

Loading

ElvinEfendi commented Jan 18, 2019

Globegitter commented Jan 18, 2019 •

edited

Loading

ElvinEfendi commented Jan 18, 2019

Globegitter commented Jan 18, 2019

aledbf commented Jan 18, 2019 •

edited

Loading

Globegitter commented Jan 18, 2019 •

edited

Loading

Globegitter commented Jan 18, 2019 •

edited

Loading

aledbf commented Jan 18, 2019

Globegitter commented Jan 18, 2019

aledbf commented Jan 21, 2019 •

edited

Loading

Globegitter commented Jan 21, 2019

Globegitter commented Jan 21, 2019

aledbf commented Jan 21, 2019

Globegitter commented Jan 21, 2019 •

edited

Loading

aledbf commented Jan 21, 2019

aledbf commented Jan 25, 2019

mdgreenwald commented Feb 4, 2019

sslavic commented Jul 6, 2020

aledbf commented Jul 6, 2020

sslavic commented Jul 6, 2020

Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image #3457

Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image #3457

Comments

EcaterinaGr commented Nov 22, 2018 • edited Loading

aledbf commented Nov 22, 2018

EcaterinaGr commented Nov 22, 2018 • edited Loading

Demonsthere commented Nov 23, 2018

fntlnz commented Nov 29, 2018

EcaterinaGr commented Dec 11, 2018

Flowkap commented Dec 21, 2018

katanyoussef commented Jan 5, 2019

Globegitter commented Jan 18, 2019 • edited Loading

ElvinEfendi commented Jan 18, 2019

Globegitter commented Jan 18, 2019 • edited Loading

ElvinEfendi commented Jan 18, 2019

Globegitter commented Jan 18, 2019

aledbf commented Jan 18, 2019 • edited Loading

Globegitter commented Jan 18, 2019 • edited Loading

Globegitter commented Jan 18, 2019 • edited Loading

aledbf commented Jan 18, 2019

Globegitter commented Jan 18, 2019

aledbf commented Jan 21, 2019 • edited Loading

Globegitter commented Jan 21, 2019

Globegitter commented Jan 21, 2019

aledbf commented Jan 21, 2019

Globegitter commented Jan 21, 2019 • edited Loading

aledbf commented Jan 21, 2019

aledbf commented Jan 25, 2019

mdgreenwald commented Feb 4, 2019

sslavic commented Jul 6, 2020

aledbf commented Jul 6, 2020

sslavic commented Jul 6, 2020

EcaterinaGr commented Nov 22, 2018 •

edited

Loading

EcaterinaGr commented Nov 22, 2018 •

edited

Loading

Globegitter commented Jan 18, 2019 •

edited

Loading

Globegitter commented Jan 18, 2019 •

edited

Loading

aledbf commented Jan 18, 2019 •

edited

Loading

Globegitter commented Jan 18, 2019 •

edited

Loading

Globegitter commented Jan 18, 2019 •

edited

Loading

aledbf commented Jan 21, 2019 •

edited

Loading

Globegitter commented Jan 21, 2019 •

edited

Loading