Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingress controller Panic while reconciling ingresses #11661

Open
rsafonseca opened this issue Jul 20, 2024 · 10 comments
Open

Ingress controller Panic while reconciling ingresses #11661

rsafonseca opened this issue Jul 20, 2024 · 10 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-priority triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@rsafonseca
Copy link
Contributor

rsafonseca commented Jul 20, 2024

What happened:

A single controller pod crashed a few times in a row, with the following stack trace (running version 1.9.5)


I0720 12:07:50.344445       7 store.go:440] "Found valid IngressClass" ingress="<REDACTED NAMESPACE>/<REDACTED INGRESS NAME>" ingressclass="nginx"
E0720 12:07:50.344653       7 runtime.go:79] Observed a panic: "invalid memory address or nil pointer dereference" (runtime error: invalid memory address or nil pointer dereference)
goroutine 193 [running]:
k8s.io/apimachinery/pkg/util/runtime.logPanic({0x1904b80?, 0x2ba0fa0})
	k8s.io/apimachinery@v0.27.6/pkg/util/runtime/runtime.go:75 +0x85
k8s.io/apimachinery/pkg/util/runtime.HandleCrash({0x0, 0x0, 0xc002baecc8?})
	k8s.io/apimachinery@v0.27.6/pkg/util/runtime/runtime.go:49 +0x6b
panic({0x1904b80?, 0x2ba0fa0?})
	runtime/panic.go:914 +0x21f
k8s.io/ingress-nginx/internal/ingress/controller.mergeAlternativeBackends(0xc002da6c00, 0xc002e52a80?, 0xc003232e80?)
	k8s.io/ingress-nginx/internal/ingress/controller/controller.go:1654 +0x6e2
k8s.io/ingress-nginx/internal/ingress/controller.(*NGINXController).getBackendServers(0xc0000510a0, {0xc000500400?, 0x49, 0x80})
	k8s.io/ingress-nginx/internal/ingress/controller/controller.go:898 +0x14f0
k8s.io/ingress-nginx/internal/ingress/controller.(*NGINXController).getConfiguration(0xc0000510a0, {0xc000500400, 0x49, 0x20?})
	k8s.io/ingress-nginx/internal/ingress/controller/controller.go:609 +0x45
k8s.io/ingress-nginx/internal/ingress/controller.(*NGINXController).syncIngress(0xc0000510a0, {0x184c6e0, 0x1})
	k8s.io/ingress-nginx/internal/ingress/controller/controller.go:177 +0x89
k8s.io/ingress-nginx/internal/task.(*Queue).worker(0xc000601830)
	k8s.io/ingress-nginx/internal/task/queue.go:130 +0x542
k8s.io/apimachinery/pkg/util/wait.BackoffUntil.func1(0x30?)
	k8s.io/apimachinery@v0.27.6/pkg/util/wait/backoff.go:226 +0x33
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(0xc00009c420?, {0x1e1f000, 0xc002b89710}, 0x1, 0xc000061680)
	k8s.io/apimachinery@v0.27.6/pkg/util/wait/backoff.go:227 +0xaf
k8s.io/apimachinery/pkg/util/wait.JitterUntil(0xc00045bfd0?, 0x3b9aca00, 0x0, 0x0?, 0x100000000000000?)
	k8s.io/apimachinery@v0.27.6/pkg/util/wait/backoff.go:204 +0x7f
k8s.io/apimachinery/pkg/util/wait.Until(...)
	k8s.io/apimachinery@v0.27.6/pkg/util/wait/backoff.go:161
k8s.io/ingress-nginx/internal/task.(*Queue).Run(0x0?, 0x0?, 0x1c8d8f8?)
	k8s.io/ingress-nginx/internal/task/queue.go:59 +0x3a
created by k8s.io/ingress-nginx/internal/ingress/controller.(*NGINXController).Start in goroutine 89
	k8s.io/ingress-nginx/internal/ingress/controller/nginx.go:315 +0x3a5
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x16900a2]

Since we're running 1.9.5, the panic appears to happen on this line where the obvious culprit is priUps being null, since altUps has a nil check a few lines above, but priUps does not

What you expected to happen:

Ingress not crashing

NGINX Ingress controller version (exec into the pod and run nginx-ingress-controller --version.): 1.9.5

Kubernetes version (use kubectl version): 1.27.11

Environment:

  • Cloud provider or hardware configuration: AWS
  • OS (e.g. from /etc/os-release): Ubuntu
  • Kernel (e.g. uname -a): 6.5.0-1022-aws
  • Install tools:
    • Please mention how/where was the cluster created like kubeadm/kops/minikube/kind etc.
  • Basic cluster related info:
    • kubectl version
    • kubectl get nodes -o wide

Anything else we need to know:

No changes were happening on the ingress which triggered this, no pod rotation or any config change, this happened during normal sync. Oddly, it happened multiple times but only on a single controller pod, our of 3 existing.

@rsafonseca rsafonseca added the kind/bug Categorizes issue or PR as related to a bug. label Jul 20, 2024
@k8s-ci-robot k8s-ci-robot added needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority labels Jul 20, 2024
@longwuyuan
Copy link
Contributor

/remove-kind bug

Can you try to reproduce on a kind cluster or a minikube cluster. thnx

/triage needs-information

@k8s-ci-robot k8s-ci-robot added triage/needs-information Indicates an issue needs more information in order to work on it. needs-kind Indicates a PR lacks a `kind/foo` label and requires one. and removed kind/bug Categorizes issue or PR as related to a bug. labels Jul 20, 2024
@longwuyuan
Copy link
Contributor

/kind support

@k8s-ci-robot k8s-ci-robot added kind/support Categorizes issue or PR as a support question. and removed needs-kind Indicates a PR lacks a `kind/foo` label and requires one. labels Jul 20, 2024
@rsafonseca
Copy link
Contributor Author

No, I can't reproduce it. We run ingress in dozens of clusters and haven't seen this before, and I'm unsure of what caused the issue.
I'm not asking for support here, I'm reporting a bug since this is an NPE which resulted in crashes, so I don't think that label swap was correct @longwuyuan

@longwuyuan
Copy link
Contributor

Hi @rsafonseca ,
In meetings, readers look for the triaged information that is proof of the bug. That is how resources are allocated. Hence the change.

If you want you can change the label.

Hoping we get some actionable data and hopefully a reproduce procedure.

@rikatz
Copy link
Contributor

rikatz commented Jul 21, 2024

Hum looking at the code this seems like a very weird but valid issue.

https://github.com/kubernetes/ingress-nginx/blob/controller-v1.9.5/internal/ingress/controller/controller.go#L1654

I can see it tries to do some assertion between 2 types of maps, and one maybe is null?

@rsafonseca can you provide a bit more of information on what ingress objects you have?

@rsafonseca
Copy link
Contributor Author

rsafonseca commented Jul 21, 2024

@longwuyuan that makes sense for behavioral bugs, for an NPE that causes panic it's pretty straightforward that it's a bug, as this should never happen leading to a crash

@rikatz It can only be one of the maps, since the other one has a nil check a few lines above. I didn't have time to follow the code (yet) as this just came up this weekend, and was hoping someone with context on that map to maybe hint at why it might be getting into a nil state (maybe some silently failed kube-api call or something like that).

I have literally hundreds of ingresses in this cluster, It would take forever to make (and redact) a full dump and it's not likely that the issue is in any way related to the ingress' content, since at least for the ingress indicated above, there were no changes, including on endpoints and it has existed for nearly a year, but i'll try to check tomorrow if this only happened on a single ingress or random ingresses (which I suspect), but it affected a single controller pod, which is odd, so i suppose it might've been due to some transient network issues on the host (e.g. failures on kube-api calls) which lead to this issue, but for now this is mere conjecture.

At worst, if the root cause isn't easily found, it might be worth to add an extra nil check for the offending map to avoid a crash.

@rikatz
Copy link
Contributor

rikatz commented Jul 21, 2024

Yeah, if you can send the PR to check this map I think it would be great

@longwuyuan
Copy link
Contributor

/kind bug
/remove-kind support
/triage accepted

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed kind/support Categorizes issue or PR as a support question. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 22, 2024
@longwuyuan
Copy link
Contributor

/remove-triage needs-information

@k8s-ci-robot k8s-ci-robot removed the triage/needs-information Indicates an issue needs more information in order to work on it. label Jul 22, 2024
Copy link

This is stale, but we won't close it automatically, just bare in mind the maintainers may be busy with other tasks and will reach your issue ASAP. If you have any question or request to prioritize this, please reach #ingress-nginx-dev on Kubernetes Slack.

@github-actions github-actions bot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. needs-priority triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
Development

No branches or pull requests

4 participants