Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E Zero DownTime Test when upgrading Envoy Proxy Versions #2610

Closed
arkodg opened this issue Feb 14, 2024 · 1 comment · Fixed by #2839
Closed

E2E Zero DownTime Test when upgrading Envoy Proxy Versions #2610

arkodg opened this issue Feb 14, 2024 · 1 comment · Fixed by #2839

Comments

@arkodg
Copy link
Contributor

arkodg commented Feb 14, 2024

I executed a naive test:

  • Environment: kind, metallb, EG quickstart.yaml
  • envoy proxy replicas: 2
  • upgrade: 0.6.0 => 0.0.0-latest using helm upgrade
  • load simulation during upgrade: hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/

The upgrade caused some client-facing failures during the test:

Error distribution:
  [8]	Get "http://172.18.255.200/": EOF
  [32]	Get "http://172.18.255.200/": dial tcp 172.18.255.200:80: connect: connection refused
  [1]	Get "http://172.18.255.200/": read tcp 172.18.0.1:55220->172.18.255.200:80: read: connection reset by peer
  [1]	Get "http://172.18.255.200/": read tcp 172.18.0.1:55260->172.18.255.200:80: read: connection reset by peer

It's probably possible to tune some of the parameters mentioned in my previous comment to achieve a hitless upgrade under certain test conditions (RPS, connection reuse, HTTP version, ...). But, I'm not sure that we can claim to have a hitless upgrade in general, based on such test.

So, I propose that for the GA scope, we focus on an upgrade test that ensures request convergence to successful execution after the upgrade. A limited hitless upgrade test can be a stretch-goal.

In the future, we can explore:

  • Implementing a graceful envoy shutdown feature and providing guidance on configuring envoy for hitless in-place upgrades
  • Supporting canary deployments

WDYT?

Originally posted by @guydc in #1712 (comment)

@guydc
Copy link
Contributor

guydc commented Mar 5, 2024

@arkodg - I think that this test can really focus on validating the functionality implemented in #2633. Meaning, we will just test a restart of a multi-replica proxy deployment, without also upgrading EG/Envoy in the process.

We have another task for validating the entire upgrade process holistically in #1710. In #1710 we may run into failures due to other reasons (e.g. control plane unavailability for new envoy instances).

Are you ok with the proposed scoping?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants