Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image #3457

Closed
EcaterinaGr opened this issue Nov 22, 2018 · 28 comments · Fixed by #3684
Closed

Regular crash of ngnix ingress controller pods after upgrade to 0.20.0 image #3457

EcaterinaGr opened this issue Nov 22, 2018 · 28 comments · Fixed by #3684

Comments

@EcaterinaGr
Copy link

EcaterinaGr commented Nov 22, 2018

Is this a request for help? (If yes, you should use our troubleshooting guide and community support channels, see https://kubernetes.io/docs/tasks/debug-application-cluster/troubleshooting/.):

What keywords did you search in NGINX Ingress controller issues before filing this one? (If you have found any duplicates, you should instead reply there.):


Is this a BUG REPORT or FEATURE REQUEST? (choose one):

NGINX Ingress controller version:
0.20.0

Kubernetes version (use kubectl version):
1.9.6

Environment:
Production

  • Cloud provider or hardware configuration:
    Azure
  • OS (e.g. from /etc/os-release):
    Ubuntu 16.04
  • Kernel (e.g. uname -a):
    Linux k8s-master-81594228-1 4.13.0-1012-azure Split implementations from generic code #15-Ubuntu SMP Thu Mar 8 10:47:27 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
    acs engine, ansible, terraform
  • Others:

What happened:
After upgrading the Nginx controller version from 0.15.0 to 0.20.0, the nginx ingress controller pods are regularly crashing after several timeout messages on the liveness probe. The nginx ingress controller pods are installed on separate VMs as all other pods. We need 0.20.0 version because we want to activate use-forwarded-headers: "false" in nginx config map to avoid the security vulnerability (user forging the headers to bypass the whitelist of nginx).

What you expected to happen:
Stable behavior of nginx ingress controller pods as in version 0.15.0.

How to reproduce it (as minimally and precisely as possible):
Update the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.15.0 to
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.20.0 on existent nginx ingress controller deployment.

Anything else we need to know:
Logs from the events:
2018-11-21 17:24:25 +0100 CET 2018-11-21 17:23:05 +0100 CET 6 nginx-ingress-controller-7d47db4569-9bxtz.1569303cf3aebbba Pod spec.containers{nginx-ingress-controller} Warning Unhealthy kubelet, k8s-dmz-81594228-0 Liveness probe failed: Get http://xx.xx.xx.xx:10254/healthz:: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
2018-11-21 17:24:26 +0100 CET 2018-11-21 17:24:26 +0100 CET 1 nginx-ingress-controller-7d47db4569-9bxtz.1569304fae92655c Pod spec.containers{nginx-ingress-controller} Normal Killing kubelet, k8s-dmz-81594228-0 Killing container with id docker://nginx-ingress-controller:Container failed liveness probe.. Container will be killed and recreated.

We have tried to increase the timeoutSeconds on liveness probe to 4s and also to add - --enable-dynamic-configuration=false in the nginx deployment.
With this configuration, the number of timeouts decreased, but after a certain charge from apps on the platform, the timeouts become more regular.

Logs from nginx pods in debug mode and timeout 3sec:
{"log":"E1121 13:30:34.808413 5 controller.go:232] Error getting ConfigMap "kube-system/udp-services": no object matching key "kube-system/udp-services" in local store\n","stream":"stderr","time":"2018-11-21T13:30:34.818557076Z"}
{"log":"I1121 13:30:37.500168 5 main.go:158] Received SIGTERM, shutting down\n","stream":"stderr","time":"2018-11-21T13:30:37.501123038Z"}
{"log":"I1121 13:30:37.500229 5 nginx.go:340] Shutting down controller queues\n","stream":"stderr","time":"2018-11-21T13:30:37.501167238Z"}
{"log":"I1121 13:30:37.500276 5 nginx.go:348] Stopping NGINX process\n","stream":"stderr","time":"2018-11-21T13:30:37.501203538Z"}

@aledbf
Copy link
Member

aledbf commented Nov 22, 2018

@EcaterinaGr please post the start of the log. Maybe you are being affected by you should follow Azure/AKS#435

@EcaterinaGr
Copy link
Author

EcaterinaGr commented Nov 22, 2018

@aledbf , are you asking for the start of events logs or of the pod itself?
Actually before SIGTERM, we can see the normal logs on the requests done on controllers and no other errors. The SIGTERM is given by the Kubenet (as we can see from the events) but we cannot identify the reason.
We are not using AKS, we use ACS engine and nginx was working well on our platform in the previous version 0.15.0.

Grabbed more logs from the docker container itself before it was killed:
2018/11/21 19:30:15 [alert] 2187#2187: *84403 open socket #47 left in connection 19
2018/11/21 19:30:15 [alert] 2187#2187: aborting
W1121 19:30:22.655014 6 controller.go:773] service utm-demo/todoapi-svc-demo does not have any active endpoints
W1121 19:30:22.655191 6 controller.go:773] service aviation-data-prod/rawstorage-service does not have any active endpoints
W1121 19:30:22.655466 6 controller.go:773] service atfm-dev/innove-gateway-service does not have any active endpoints
W1121 19:30:22.657725 6 controller.go:773] service utm-dev/utm-backend-api-svc-test does not have any active endpoints
W1121 19:30:22.658510 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.658665 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.658768 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
W1121 19:30:22.659129 6 controller.go:1020] ssl certificate "aviation-data-prod/aviation-data-prod-tls" does not exist in local store
E1121 19:30:22.660605 6 controller.go:221] unexpected error reading configmap kube-system/tcp-services: configmap kube-system/tcp-services was not found
E1121 19:30:22.660670 6 controller.go:221] unexpected error reading configmap kube-system/udp-services: configmap kube-system/udp-services was not found
I1121 19:30:25.914652 6 main.go:127] Received SIGTERM, shutting down
I1121 19:30:25.914678 6 nginx.go:362] shutting down controller queues
I1121 19:30:25.914709 6 status.go:119] updating status of Ingress rules (remove)
I1121 19:30:25.954696 6 nginx.go:370] stopping NGINX process...
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1400
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1400
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1416
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:1416
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:8990
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:8990
2018/11/21 19:30:25 [warn] 2401#2401: duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:9006
nginx: [warn] duplicate network "0.0.0.0/0", value: "0", old value: "1" in /etc/nginx/nginx.conf:9006
2018/11/21 19:30:25 [notice] 2401#2401: signal process started

@Demonsthere
Copy link

Hi,
I wanted to note, that I had the same constant crashing with 0.20 on GCP. I updated to 0.21 just now, and it is better, but the crashes still occur. This time I saw in the logs:

I1123 10:25:15.605676       9 backend_ssl.go:189] Updating local copy of SSL certificate "default/prow-tls" with missing intermediate CA certs
I1123 10:29:47.579872       9 main.go:163] Received SIGTERM, shutting down
I1123 10:29:47.579901       9 nginx.go:343] Shutting down controller queues
I1123 10:29:47.579917       9 status.go:200] updating status of Ingress rules (remove)

@fntlnz
Copy link
Contributor

fntlnz commented Nov 29, 2018

Related to this: #2833

@EcaterinaGr
Copy link
Author

hello

Any updates regarding the issue?

@Flowkap
Copy link

Flowkap commented Dec 21, 2018

As soon as we add certificates (same creation as on our dev clusters) nginx starts to crash.
As long as one is alive it actually works! But is does not take too long to finally crash all instances :/
We do not observer this on Kubernetes 1.11.3 but on 1.12.3

@katanyoussef
Copy link

solution for this bug ? plz thanks

@Globegitter
Copy link
Contributor

Globegitter commented Jan 18, 2019

We have been experiencing similar issues where we see:

  Warning  Unhealthy  8m5s (x12 over 20m)    kubelet, ip-xxx-eu-central-1.compute.internal  Readiness probe failed: Get http://100.96.5.100:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  8m1s (x11 over 20m)    kubelet, ip-xxx.eu-central-1.compute.internal  Liveness probe failed: Get http://100.96.5.100:10254/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

much more frequently for some of our nginx pods than we did in the past and it is leading to public facing errors, sometimes it seems to manifest as cert errors (we do TLS termination with nginx). I don't know exactly when it started but we started noticing it when we where on 0.20.0 and now we are on 0.22.0 and we are still seeing these issues.

Our probes timeout is set to 1s and we also do not seem to see any error in the nginx controller itself. And as of this time we have no idea why this is happening.

We are running on AWS with a kops cluster on k8s 1.11.6.

@ElvinEfendi
Copy link
Member

@Globegitter how does memory and cpu utilization look like during that time? It could be network related too.

If it is feasible you can also try to downgrade to the earlier ingress version where it worked well for you and confirm that it is because if ingress-nginx.

@Globegitter
Copy link
Contributor

Globegitter commented Jan 18, 2019

@ElvinEfendi You mean specifically of these nginx pods? Or you mean node/cluster wide? But yeah we still have to investigate further and we have been serving quite a bit more traffic since over a month ago. I just found this issues seemed interestingly fitting but could very much just be coincidence.

Unfortunately we are using a 0.20.0 specific feature so downgrading is going to be a bit more difficult, but depending on how this continues we'll definitely be doing some investigation on that next week.

@ElvinEfendi
Copy link
Member

You mean specifically of these nginx pods

I guess both. I'm suggesting that to make sure timeout is not because the pod is too busy doing other things and therefore can not respond health checks timely.

@Globegitter
Copy link
Contributor

@ElvinEfendi yeah that is a good pointer: The last 2 hours or so we only ever had on pod crashing a fair bit. and it is that pod that has a much higher cpu usage and memory usage than the other pods. So that is very interesting, as to why that one pod would have a much higher usage than our other 3 pods? Further we also have a cluster wide traffic of <2000req/s, so even if that was just hitting one nginx I would still expect that to be able to respond to health probes.

@aledbf
Copy link
Member

aledbf commented Jan 18, 2019

@Globegitter maybe you can help us to test this running some scripts :)

  1. Get information about the running pods:
kubectl get pod -o wide -n ingress-nginx
  1. SSH to a node in the cluster

  2. Build a list with the IP addresses and run a script

POD_IPS="XX.XXX.XXX XX.XXX.XXX" # using the IPs from kubectl
while true;do
  sleep 1
  for x in $POD_IPS;do
    echo "[$(date)] http://$x:10254/healthz http status:" $(curl -I -o /dev/null -s -w "%{http_code}::%{time_namelookup}::%{time_connect}::%{time_starttransfer}::%{time_total}\n" http://$x:10254/healthz);
  done
done 

This will print something like


[Fri Jan 18 15:47:33 UTC 2019] http://100.96.9.248:10254/healthz http status: 200::0.000029::0.000112::0.001546::0.001582
[Fri Jan 18 15:47:33 UTC 2019] http://100.96.3.200:10254/healthz http status: 200::0.000023::0.000817::0.002746::0.002782

This can help us to test what @ElvinEfendi said in the previous comment

Edit: maybe I should put this in a k8s Job to help to debug this issue.

@Globegitter
Copy link
Contributor

Globegitter commented Jan 18, 2019

@aledbf Aah yeah that is a good idea on debugging - we just raised the memory limit and the pods are re-rolling. But if the timeout issue persists / pops up again I will make sure to make use of the debug script and post the results here.

Edit: And a job for that would really be useful.

@Globegitter
Copy link
Contributor

Globegitter commented Jan 18, 2019

@aledbf it is still happening, here some output:

[Fri Jan 18 16:14:08 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000031::0.003766::0.011774::0.011802
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.29.19:10254/healthz http status: 000::0.000026::0.000000::0.000000::0.000084
[Fri Jan 18 16:14:08 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.000289::2.548471::2.548508
[Fri Jan 18 16:14:10 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000719::0.010645::0.010668
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000742::0.002636::0.002657
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000024::0.000305::0.001203::0.001222
[Fri Jan 18 16:14:12 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000017::0.000239::2.434425::2.434456
[Fri Jan 18 16:14:14 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000700::0.003581::0.003603
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000023::0.000735::0.002499::0.002522
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000260::0.004873::0.004893
[Fri Jan 18 16:14:15 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000015::0.000253::2.936678::2.936709
[Fri Jan 18 16:14:18 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000024::0.000730::0.005554::0.005575
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000835::0.004889::0.004912
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000029::0.000293::0.004889::0.004909
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000015::0.000245::0.000943::0.000963
[Fri Jan 18 16:14:19 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000021::0.001020::0.003824::0.003846
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000713::0.004364::0.004391
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000036::0.000363::0.001800::0.001818
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000263::0.001097::0.001117
[Fri Jan 18 16:14:20 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000684::0.002500::0.002515
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000019::0.000715::0.002270::0.002295
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.002707::0.005841::0.005860
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000243::0.000742::0.000760
[Fri Jan 18 16:14:21 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000016::0.004325::0.006078::0.006114
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000029::0.000797::0.007521::0.007544
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000019::0.000263::0.003320::0.003340
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000016::0.000252::0.001245::0.001263
[Fri Jan 18 16:14:22 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.000680::0.002750::0.002767
[Fri Jan 18 16:14:23 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000723::0.002503::0.002525
[Fri Jan 18 16:14:23 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000016::0.000238::1.813099::1.813174
[Fri Jan 18 16:14:25 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000017::0.000262::0.001793::0.001815
[Fri Jan 18 16:14:25 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000021::0.000799::0.002882::0.002901
[Fri Jan 18 16:14:26 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000738::0.002451::0.002471
[Fri Jan 18 16:14:26 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000259::3.507546::3.507583
[Fri Jan 18 16:14:29 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000029::0.000321::0.000890::0.000914
[Fri Jan 18 16:14:29 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000026::0.000910::0.004054::0.004078
[Fri Jan 18 16:14:30 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000028::0.000781::0.158132::0.158176
[Fri Jan 18 16:14:31 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000326::2.694099::2.694130
[Fri Jan 18 16:14:33 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000023::0.000311::0.004423::0.004448
[Fri Jan 18 16:14:33 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000742::0.003229::0.003247
[Fri Jan 18 16:14:34 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000024::0.000782::0.004251::0.004272
[Fri Jan 18 16:14:34 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000017::0.000241::2.955700::2.955730
[Fri Jan 18 16:14:37 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000019::0.000262::0.001195::0.001246
[Fri Jan 18 16:14:37 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.001047::0.004546::0.004565
[Fri Jan 18 16:14:38 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000712::0.002325::0.002363
[Fri Jan 18 16:14:38 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000268::2.350015::2.350049
[Fri Jan 18 16:14:41 UTC 2019] http://100.96.29.81:10254/healthz http status: 000::0.000023::0.000000::0.000000::0.000069
[Fri Jan 18 16:14:41 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000712::0.005419::0.005438
[Fri Jan 18 16:14:42 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.001377::0.004157::0.004178
[Fri Jan 18 16:14:42 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000015::0.000240::2.529109::2.529148
[Fri Jan 18 16:14:44 UTC 2019] http://100.96.29.81:10254/healthz http status: 500::0.000019::0.000258::0.000976::0.000996
[Fri Jan 18 16:14:44 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000017::0.000737::0.007455::0.007473
[Fri Jan 18 16:14:45 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000021::0.000763::0.003782::0.003808
[Fri Jan 18 16:14:45 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000015::0.000321::1.605230::1.605260
[Fri Jan 18 16:14:47 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000019::0.000263::0.003015::0.003037
[Fri Jan 18 16:14:47 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000039::0.000741::0.006094::0.006114
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000708::0.002475::0.002499
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000017::0.000250::0.000817::0.000833
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000040::0.000250::0.008481::0.008504
[Fri Jan 18 16:14:48 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000014::0.000694::0.004170::0.004189
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000017::0.000761::0.008205::0.008231
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000019::0.000414::0.001783::0.001802
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000014::0.000211::0.504140::0.504379
[Fri Jan 18 16:14:49 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000023::0.000717::0.004022::0.004043
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000740::0.002447::0.002472
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000018::0.000275::0.001992::0.002008
[Fri Jan 18 16:14:50 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000020::0.000231::3.389365::3.389403
[Fri Jan 18 16:14:54 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000022::0.000727::0.004877::0.004900
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000815::0.002551::0.002571
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000016::0.000264::0.000845::0.000871
[Fri Jan 18 16:14:55 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000016::0.000233::2.998702::2.998749
[Fri Jan 18 16:14:58 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000024::0.000761::0.011797::0.011820
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000018::0.000724::0.002941::0.002962
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000021::0.000262::0.001866::0.001886
[Fri Jan 18 16:14:59 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000025::0.000298::2.891728::2.891762
[Fri Jan 18 16:15:02 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000716::0.002861::0.002882
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000025::0.002314::0.006595::0.006644
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000027::0.001592::0.002399::0.002424
[Fri Jan 18 16:15:03 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000537::3.195221::3.195255
[Fri Jan 18 16:15:06 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000018::0.000738::0.002648::0.002672
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000022::0.000738::0.003204::0.003243
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000019::0.000262::0.002440::0.002463
[Fri Jan 18 16:15:07 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000020::0.000231::2.069995::2.070088
[Fri Jan 18 16:15:09 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000019::0.000710::0.002243::0.002261
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000766::0.020720::0.020747
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.29.19:10254/healthz http status: 000::0.000019::0.000000::0.000000::0.000069
[Fri Jan 18 16:15:10 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000019::0.000242::2.151019::2.151056
[Fri Jan 18 16:15:12 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000019::0.000688::0.002563::0.002586
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000727::0.008094::0.008117
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.29.19:10254/healthz http status: 500::0.000016::0.000286::0.001543::0.001564
[Fri Jan 18 16:15:13 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000014::0.000244::1.980282::1.980323
[Fri Jan 18 16:15:15 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000055::0.000778::0.004642::0.004668
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000025::0.000897::0.003045::0.003076
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000323::0.006020::0.006047
[Fri Jan 18 16:15:16 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.000308::2.500127::2.500182
[Fri Jan 18 16:15:19 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000020::0.000723::0.007644::0.007663
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000028::0.000737::0.002864::0.002886
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000022::0.000280::0.003027::0.003052
[Fri Jan 18 16:15:20 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000303::1.673258::1.673299
[Fri Jan 18 16:15:22 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000034::0.002531::0.004833::0.004939
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000031::0.000776::0.007756::0.007789
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000026::0.000483::0.005270::0.005308
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000028::0.000281::0.274060::0.274095
[Fri Jan 18 16:15:23 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000025::0.000857::0.020886::0.020911
[Fri Jan 18 16:15:24 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000024::0.000709::0.002947::0.002970
[Fri Jan 18 16:15:24 UTC 2019] http://100.96.29.19:10254/healthz http status: 200::0.000018::0.000241::3.363062::3.363098
[Fri Jan 18 16:15:27 UTC 2019] http://100.96.29.81:10254/healthz http status: 200::0.000022::0.001813::0.009423::0.009447
[Fri Jan 18 16:15:27 UTC 2019] http://100.96.5.156:10254/healthz http status: 200::0.000023::0.000943::0.037287::0.037332
[Fri Jan 18 16:15:28 UTC 2019] http://100.96.3.45:10254/healthz http status: 200::0.000020::0.000745::0.003082::0.003106

Some of the 500s mixed in are from when the pod was taken out of service from the k8s probes, but interesting to see that health checks sometimes take up to 3.5s.

@aledbf
Copy link
Member

aledbf commented Jan 18, 2019

@Globegitter please check the log of the pod with IP 100.96.29.19.

@Globegitter
Copy link
Contributor

Just checked, we do not have a logging collector active yet for the nginx logs so only see part of the logs of the previous pod that errored. I do see a few 2019/01/18 16:34:00 [error] 139#139: *12429 upstream prematurely closed connection while reading response header from upstream, client: xxxxxx, server: xxxxx, request: "GET /about HTTP/2.0", upstream: "http://100.96.29.102:8001/about", host: "xxxxx", referrer: "xxxxxxx"

and at the end:

I0118 16:34:08.456836       7 nginx.go:385] NGINX process has stopped
I0118 16:34:08.456938       7 main.go:175] Handled quit, awaiting Pod deletion
I0118 16:34:18.457336       7 main.go:178] Exiting with 0

Otherwise it is just 900+ lines of normal json log of requests being made.

I can take a look next week at exporting the logs, or tailing them at the same time. Would you expect to see something in the error log? Or just the normal response logging?

@aledbf
Copy link
Member

aledbf commented Jan 21, 2019

This issue is hard to reproduce, for this reason, I created #3684 to see if we can narrow the scope issue. Basically, I added additional logs in the health check inside the ingress controller to detect where we receive an error.

Please, help us using the image quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

Note: for security reasons, the probes do not show the exact cause of the failure unless you increase the log level of kubelet to a value higher than five (this is the reason why I added the additional logs)

Note 2: this PR will never be merged for the same reason, it's just to track the reason

@Globegitter
Copy link
Contributor

@aledbf Thanks I will start investigating now, on Friday we could resolve the issue by increasing the number of replicas. It started happening again. Will test this image.

@Globegitter
Copy link
Contributor

So, we have some output:

E0121 15:17:44.742314       7 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
...
...
I0121 15:17:48.427437       7 nginx.go:385] NGINX process has stopped
I0121 15:17:48.428097       7 main.go:175] Handled quit, awaiting Pod deletion
E0121 15:17:54.742273       7 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
I0121 15:17:58.428663       7 main.go:178] Exiting with 0

for another pod:

E0121 15:15:06.599845       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
...
...
I0121 15:15:09.895122       6 nginx.go:385] NGINX process has stopped
I0121 15:15:09.895267       6 main.go:175] Handled quit, awaiting Pod deletion
E0121 15:15:16.600000       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
I0121 15:15:19.896690       6 main.go:178] Exiting with 0

But again that is just from the logs that I could tail, but at least it is giving us something more. Also I am currently tailing all running pods with a grep for healthcheck so if this happens again we should see any of the logs there.

@aledbf
Copy link
Member

aledbf commented Jan 21, 2019

Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080

This is "normal" because the health check starts before nginx. You should see only one of this error in the log.

E0121 15:17:44.742314
....
E0121 15:17:54.742273

This is not ok. This output means the ingress controller startup (nginx binary) takes more than ten seconds for the initial sync, which is not normal.

@Globegitter
Copy link
Contributor

Globegitter commented Jan 21, 2019

yep all logs the same:

E0121 15:37:26.600460       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:36.601935       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:46.600979       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:37:56.600502       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller E0121 15:38:06.600290       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-ggn46 nginx-ingress-controller public-nginx-ingress-controller-6d4c798cf6-tt7z9 E0121 15:38:16.600088       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-kll8s nginx-ingress-controller E0121 15:38:35.487322       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
E0121 15:38:45.536244       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
E0121 15:38:55.487948       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused
public-nginx-ingress-controller-6d4c798cf6-kll8s nginx-ingress-controller E0121 15:39:05.487299       6 checker.go:45] healthcheck error: Get http://127.0.0.1:18080/healthz: dial tcp 127.0.0.1:18080: connect: connection refused

Here again, I just looked over the logs of all the running pods and jsut grepped for healthcheck.

Edit: Ah interesting, I will try and keep an eye out specifically on startup - some of these definitely happen to instances that have been running well for a while but we have not looked at startup "issues" yet.

@aledbf
Copy link
Member

aledbf commented Jan 21, 2019

@Globegitter please check and post the generated nginx.conf

kubectl exec -n ingress-nginx <pod> -it cat nginx.conf |grep 18080

Is IPV6 enabled? (or only IPV6)
What CNI provider are you using?

@aledbf
Copy link
Member

aledbf commented Jan 25, 2019

To those affected by this issue:

Please help us to test a fix for this with #3684 using the image
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

The mentioned PR contains a refactoring of the nginx server used for health-check and Lua configuration replacing the TCP port with a unix socket.

@mdgreenwald
Copy link

Hi, we were affected by this issue. We saw 150 connections per second would cause our ingress controller to restart. When we checked the resource it was failing health checks just as described here in this issue. We would sometimes see timeouts as high as 10.x seconds.

We tested again with:
quay.io/kubernetes-ingress-controller/nginx-ingress-controller:dev

This dev image fully resolved our issue, we are no longer seeing restarts at high connection rates.

@sslavic
Copy link
Contributor

sslavic commented Jul 6, 2020

Do I get it right, #4531 (shipped with 0.26.0) practically reverted changes made for this issue in #3684 (shipped with 0.23.0) ?

@aledbf
Copy link
Member

aledbf commented Jul 6, 2020

@sslavic yes

@sslavic
Copy link
Contributor

sslavic commented Jul 6, 2020

This could improve situation #5832 once released.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants