Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle EnvoyProxy Image version upgrades #1712

Closed
Tracked by #1710
arkodg opened this issue Jul 26, 2023 · 14 comments
Closed
Tracked by #1710

Handle EnvoyProxy Image version upgrades #1712

arkodg opened this issue Jul 26, 2023 · 14 comments

Comments

@arkodg
Copy link
Contributor

arkodg commented Jul 26, 2023

No description provided.

@arkodg arkodg mentioned this issue Jul 26, 2023
4 tasks
@arkodg arkodg added this to the 0.6.0-rc1 milestone Jul 26, 2023
@arkodg arkodg added help wanted Extra attention is needed area/installation labels Jul 26, 2023
@cnvergence
Copy link
Member

cnvergence commented Aug 1, 2023

I am interested in picking this up :)

@arkodg
Copy link
Contributor Author

arkodg commented Aug 1, 2023

thanks @cnvergence !
Thinking out loud, an outcome of this issue could be a E2E where all client requests are successful while 2 replicas of Envoy Proxy are undergoing a rolling restart.

@arkodg arkodg removed the help wanted Extra attention is needed label Aug 1, 2023
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Aug 31, 2023
@cnvergence
Copy link
Member

@arkodg coming back to this after a while, could you please point me to where should I start?
I did check the upgrade and it seems like it is handled, but I may be wrong.

As for the E2E, I assume I should add a new scenario to the e2e test suite :)

@github-actions github-actions bot removed the stale label Sep 1, 2023
@github-actions
Copy link

github-actions bot commented Oct 1, 2023

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Oct 1, 2023
@arkodg arkodg removed the stale label Oct 2, 2023
@arkodg
Copy link
Contributor Author

arkodg commented Oct 7, 2023

hey I know @chauhanshubham was looking into a similar test for control plane upgrades which would invariably also upgrade envoy proxy, should we just collapse those two e2e tests into one where we perform an upgrade with a last known EG minor version, and ensure that

  1. there is no config churn in the data plane during an upgrade
  2. there is no traffic drop in the data plane while this upgrade this happens (assuming we always have 2 replicas of control plane and data plane running)

@arkodg arkodg modified the milestones: 0.6.0-rc1, Backlog Oct 20, 2023
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Nov 20, 2023
@Xunzhuo Xunzhuo modified the milestones: Backlog, v1.0.0-rc1 Dec 7, 2023
@github-actions github-actions bot removed the stale label Dec 7, 2023
Copy link

github-actions bot commented Jan 6, 2024

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Jan 6, 2024
@guydc
Copy link
Contributor

guydc commented Feb 14, 2024

I'm concerned that a hitless in-place upgrade of envoy is not trivial.

A graceful termination of envoy may require:

  • Failing LB/Kubelet probes to stop new connection from being established to terminating pods
  • Triggering envoy to drain listeners
  • Delaying pod termination until all connections are terminated
  • Additional factors to consider:
    • IaaS LBs behave differently in response to targets becoming unhealthy (e.g. AWS will reset all existing connections, while GCP/Azure will stop establishing new connections but retain existing connections) and have varying levels of configurability for HCs.
    • The number of envoy pods per node and LB's ExternalTrafficPolicy impact the correctness of LB HCs.

It's also important to avoid race conditions where a new instance of envoy is receiving traffic before it was configured (e.g. due to order of component restart, failures in new control plane version, etc.).

Some prior art:

@github-actions github-actions bot removed the stale label Feb 14, 2024
@guydc
Copy link
Contributor

guydc commented Feb 14, 2024

@arkodg

I executed a naive test:

  • Environment: kind, metallb, EG quickstart.yaml
  • envoy proxy replicas: 2
  • upgrade: 0.6.0 => 0.0.0-latest using helm upgrade
  • load simulation during upgrade: hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/

The upgrade caused some client-facing failures during the test:

Error distribution:
  [8]	Get "http://172.18.255.200/": EOF
  [32]	Get "http://172.18.255.200/": dial tcp 172.18.255.200:80: connect: connection refused
  [1]	Get "http://172.18.255.200/": read tcp 172.18.0.1:55220->172.18.255.200:80: read: connection reset by peer
  [1]	Get "http://172.18.255.200/": read tcp 172.18.0.1:55260->172.18.255.200:80: read: connection reset by peer

It's probably possible to tune some of the parameters mentioned in my previous comment to achieve a hitless upgrade under certain test conditions (RPS, connection reuse, HTTP version, ...). But, I'm not sure that we can claim to have a hitless upgrade in general, based on such test.

So, I propose that for the GA scope, we focus on an upgrade test that ensures request convergence to successful execution after the upgrade. A limited hitless upgrade test can be a stretch-goal.

In the future, we can explore:

  • Implementing a graceful envoy shutdown feature and providing guidance on configuring envoy for hitless in-place upgrades
  • Supporting canary deployments

WDYT?

@arkodg
Copy link
Contributor Author

arkodg commented Feb 14, 2024

hey @guydc I was hoping we could have some test for hitless upgrade in v1.0, with caveats, that can hopefully we removed over time post GA
do agree, we can split this up, and make it a stretch goal for v1.0

@arkodg
Copy link
Contributor Author

arkodg commented Feb 27, 2024

this should be fixed with #2633, keeping this open so that it can be validated with a e2e

@arkodg arkodg modified the milestones: v1.0.0-rc1, v1.0.0 Feb 27, 2024
@arkodg arkodg modified the milestones: v1.0.0, v1.1.0-rc1 Mar 28, 2024
Copy link

This issue has been automatically marked as stale because it has not had activity in the last 30 days.

@github-actions github-actions bot added the stale label Apr 27, 2024
@arkodg
Copy link
Contributor Author

arkodg commented May 8, 2024

fixed with #2862

@arkodg arkodg closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

No branches or pull requests

4 participants