Ingress controller Panic while reconciling ingresses #11661

rsafonseca · 2024-07-20T12:35:48Z

What happened:

A single controller pod crashed a few times in a row, with the following stack trace (running version 1.9.5)


I0720 12:07:50.344445       7 store.go:440] "Found valid IngressClass" ingress="<REDACTED NAMESPACE>/<REDACTED INGRESS NAME>" ingressclass="nginx"
E0720 12:07:50.344653       7 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 193 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1904b80?, 0x2ba0fa0})
	k8s.io/apimachinery@v0.27.6/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc002baecc8?})
	k8s.io/apimachinery@v0.27.6/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1904b80?, 0x2ba0fa0?})
	runtime/panic.go:914 +0x21f
k8s.io/ingress-nginx/internal/ingress/controller.mergeAlternativeBackends(0xc002da6c00, 0xc002e52a80?, 0xc003232e80?)
	k8s.io/ingress-nginx/internal/ingress/controller/controller.go:1654 +0x6e2
k8s.io/ingress-nginx/internal/ingress/controller.(*NGINXController).getBackendServers(0xc0000510a0, {0xc000500400?, 0x49, 0x80})
	k8s.io/ingress-nginx/internal/ingress/controller/controller.go:898 +0x14f0
k8s.io/ingress-nginx/internal/ingress/controller.(*NGINXController).getConfiguration(0xc0000510a0, {0xc000500400, 0x49, 0x20?})
	k8s.io/ingress-nginx/internal/ingress/controller/controller.go:609 +0x45
k8s.io/ingress-nginx/internal/ingress/controller.(*NGINXController).syncIngress(0xc0000510a0, {0x184c6e0, 0x1})
	k8s.io/ingress-nginx/internal/ingress/controller/controller.go:177 +0x89
k8s.io/ingress-nginx/internal/task.(*Queue).worker(0xc000601830)
	k8s.io/ingress-nginx/internal/task/queue.go:130 +0x542
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	k8s.io/apimachinery@v0.27.6/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00009c420?, {0x1e1f000, 0xc002b89710}, 0x1, 0xc000061680)
	k8s.io/apimachinery@v0.27.6/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00045bfd0?, 0x3b9aca00, 0x0, 0x0?, 0x100000000000000?)
	k8s.io/apimachinery@v0.27.6/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	k8s.io/apimachinery@v0.27.6/pkg/util/wait/backoff.go:161
k8s.io/ingress-nginx/internal/task.(*Queue).Run(0x0?, 0x0?, 0x1c8d8f8?)
	k8s.io/ingress-nginx/internal/task/queue.go:59 +0x3a
created by k8s.io/ingress-nginx/internal/ingress/controller.(*NGINXController).Start in goroutine 89
	k8s.io/ingress-nginx/internal/ingress/controller/nginx.go:315 +0x3a5
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x16900a2]

Since we're running 1.9.5, the panic appears to happen on this line where the obvious culprit is priUps being null, since altUps has a nil check a few lines above, but priUps does not

What you expected to happen:

Ingress not crashing

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.9.5

Kubernetes version (use kubectl version): 1.27.11

Environment:

Cloud provider or hardware configuration: AWS
OS (e.g. from /etc/os-release): Ubuntu
Kernel (e.g. uname -a): 6.5.0-1022-aws
Install tools:
- Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
Basic cluster related info:
- kubectl version
- kubectl get nodes -o wide

Anything else we need to know:

No changes were happening on the ingress which triggered this, no pod rotation or any config change, this happened during normal sync. Oddly, it happened multiple times but only on a single controller pod, our of 3 existing.

The text was updated successfully, but these errors were encountered:

longwuyuan · 2024-07-20T14:56:36Z

/remove-kind bug

Can you try to reproduce on a kind cluster or a minikube cluster. thnx

/triage needs-information

longwuyuan · 2024-07-20T15:00:03Z

/kind support

rsafonseca · 2024-07-20T15:47:19Z

No, I can't reproduce it. We run ingress in dozens of clusters and haven't seen this before, and I'm unsure of what caused the issue.
I'm not asking for support here, I'm reporting a bug since this is an NPE which resulted in crashes, so I don't think that label swap was correct @longwuyuan

longwuyuan · 2024-07-21T10:49:14Z

Hi @rsafonseca ,
In meetings, readers look for the triaged information that is proof of the bug. That is how resources are allocated. Hence the change.

If you want you can change the label.

Hoping we get some actionable data and hopefully a reproduce procedure.

rikatz · 2024-07-21T18:12:11Z

Hum looking at the code this seems like a very weird but valid issue.

https://github.com/kubernetes/ingress-nginx/blob/controller-v1.9.5/internal/ingress/controller/controller.go#L1654

I can see it tries to do some assertion between 2 types of maps, and one maybe is null?

@rsafonseca can you provide a bit more of information on what ingress objects you have?

rsafonseca · 2024-07-21T19:48:20Z

@longwuyuan that makes sense for behavioral bugs, for an NPE that causes panic it's pretty straightforward that it's a bug, as this should never happen leading to a crash

@rikatz It can only be one of the maps, since the other one has a nil check a few lines above. I didn't have time to follow the code (yet) as this just came up this weekend, and was hoping someone with context on that map to maybe hint at why it might be getting into a nil state (maybe some silently failed kube-api call or something like that).

I have literally hundreds of ingresses in this cluster, It would take forever to make (and redact) a full dump and it's not likely that the issue is in any way related to the ingress' content, since at least for the ingress indicated above, there were no changes, including on endpoints and it has existed for nearly a year, but i'll try to check tomorrow if this only happened on a single ingress or random ingresses (which I suspect), but it affected a single controller pod, which is odd, so i suppose it might've been due to some transient network issues on the host (e.g. failures on kube-api calls) which lead to this issue, but for now this is mere conjecture.

At worst, if the root cause isn't easily found, it might be worth to add an extra nil check for the offending map to avoid a crash.

rikatz · 2024-07-21T19:49:44Z

Yeah, if you can send the PR to check this map I think it would be great

longwuyuan · 2024-07-22T02:38:00Z

/kind bug
/remove-kind support
/triage accepted

longwuyuan · 2024-07-22T02:38:52Z

/remove-triage needs-information

github-actions · 2024-08-22T01:56:58Z

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

rsafonseca added the kind/bug Categorizes issue or PR as related to a bug. label Jul 20, 2024

k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jul 20, 2024

k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 20, 2024

k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jul 20, 2024

k8s-ci-robot removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 22, 2024

chengjoey mentioned this issue Aug 13, 2024

Controller: Fix panic in alternative backend merging. #11789

Merged

10 tasks

github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingress controller Panic while reconciling ingresses #11661

Ingress controller Panic while reconciling ingresses #11661

rsafonseca commented Jul 20, 2024 •

edited

Loading

longwuyuan commented Jul 20, 2024

longwuyuan commented Jul 20, 2024

rsafonseca commented Jul 20, 2024

longwuyuan commented Jul 21, 2024

rikatz commented Jul 21, 2024

rsafonseca commented Jul 21, 2024 •

edited

Loading

rikatz commented Jul 21, 2024

longwuyuan commented Jul 22, 2024

longwuyuan commented Jul 22, 2024

github-actions bot commented Aug 22, 2024

Ingress controller Panic while reconciling ingresses #11661

Ingress controller Panic while reconciling ingresses #11661

Comments

rsafonseca commented Jul 20, 2024 • edited Loading

longwuyuan commented Jul 20, 2024

longwuyuan commented Jul 20, 2024

rsafonseca commented Jul 20, 2024

longwuyuan commented Jul 21, 2024

rikatz commented Jul 21, 2024

rsafonseca commented Jul 21, 2024 • edited Loading

rikatz commented Jul 21, 2024

longwuyuan commented Jul 22, 2024

longwuyuan commented Jul 22, 2024

github-actions bot commented Aug 22, 2024

rsafonseca commented Jul 20, 2024 •

edited

Loading

rsafonseca commented Jul 21, 2024 •

edited

Loading