You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The upgrade caused some client-facing failures during the test:
Error distribution:
[8] Get "http://172.18.255.200/": EOF
[32] Get "http://172.18.255.200/": dial tcp 172.18.255.200:80: connect: connection refused
[1] Get "http://172.18.255.200/": read tcp 172.18.0.1:55220->172.18.255.200:80: read: connection reset by peer
[1] Get "http://172.18.255.200/": read tcp 172.18.0.1:55260->172.18.255.200:80: read: connection reset by peer
It's probably possible to tune some of the parameters mentioned in my previous comment to achieve a hitless upgrade under certain test conditions (RPS, connection reuse, HTTP version, ...). But, I'm not sure that we can claim to have a hitless upgrade in general, based on such test.
So, I propose that for the GA scope, we focus on an upgrade test that ensures request convergence to successful execution after the upgrade. A limited hitless upgrade test can be a stretch-goal.
In the future, we can explore:
Implementing a graceful envoy shutdown feature and providing guidance on configuring envoy for hitless in-place upgrades
@arkodg - I think that this test can really focus on validating the functionality implemented in #2633. Meaning, we will just test a restart of a multi-replica proxy deployment, without also upgrading EG/Envoy in the process.
We have another task for validating the entire upgrade process holistically in #1710. In #1710 we may run into failures due to other reasons (e.g. control plane unavailability for new envoy instances).
I executed a naive test:
helm upgrade
hey -c 100 -q 10 -z 300s -host www.example.com http://172.18.255.200/
The upgrade caused some client-facing failures during the test:
It's probably possible to tune some of the parameters mentioned in my previous comment to achieve a hitless upgrade under certain test conditions (RPS, connection reuse, HTTP version, ...). But, I'm not sure that we can claim to have a hitless upgrade in general, based on such test.
So, I propose that for the GA scope, we focus on an upgrade test that ensures request convergence to successful execution after the upgrade. A limited hitless upgrade test can be a stretch-goal.
In the future, we can explore:
WDYT?
Originally posted by @guydc in #1712 (comment)
The text was updated successfully, but these errors were encountered: