Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover from reconciler panics #4332

Open
guydc opened this issue Sep 25, 2024 · 1 comment
Open

Recover from reconciler panics #4332

guydc opened this issue Sep 25, 2024 · 1 comment
Labels
area/provider provider/kubernetes Issues related to the Kubernetes provider stale

Comments

@guydc
Copy link
Contributor

guydc commented Sep 25, 2024

Description:
Currently, a panic in the reconciliation flow of Envoy Gateway will lead to EG crashing: #4291, #2661, #1830, #2882.

Controller frameworks like controller runtime and api-machinery provide the means to recover from panics:

In the context of Envoy Gateway, a reconciliation crash would have several undesired side affects:

  • last-known-good XDS caches would be deleted and not recovered after a restart
  • infra manager disrupted during infra reconciliation, possibly creating an inconsistent infra state where only some changes are applied

If a crash occurs during an upgrade, there is a risk that envoy proxies would be replaced (e.g. due to a new proxy version being used), but no configuration is provided by the control plane, leading to a complete outage for users.

Envoy Gateway should consider recovering from panics by default or allowing users to opt-in for panic recovery. If implemented, metrics should be provided to users, so that operators are made aware of the fact that XDS translation is broken.

@guydc guydc added provider/kubernetes Issues related to the Kubernetes provider area/provider labels Sep 25, 2024
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provider provider/kubernetes Issues related to the Kubernetes provider stale
Projects
None yet
Development

No branches or pull requests

1 participant