You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I believe enabling Dynamic Stable Scale with ALB TargetGroup Weight Verification can result in the stable ReplicaSet being scaled down before TargetGroup weight is verified. This can potentially cause traffic to be directed to a TargetGroup with 0 targets, resulting in errors being returned.
Related, enabling Dynamic Stable Scale without ALB TargetGroup Weight Verification most likely has a similar issue, since no verification is performed in the first place.
This is a scenario I've observed several times in our production environment:
Traffic weight is updated from 50% canary to 100% canary
There is a delay updating the underlying TargetGroups (the scenario that TargetGroup weight verification is intended to mitigate)
Rollouts controller reconciles the stable ReplicaSet and scales it down to 0 replicas
This results in a state where the TargetGroup has 0 targets but is still receiving 50% of traffic, causing errors to be returned
Some time later, the underlying TargetGroup weights are updated and the desired weight is verified
I had a brief look at the code to see if I could validate this. Here https://github.com/argoproj/argo-rollouts/blob/master/rollout/canary.go#L116 I can see c.rollout.Status.Canary.Weights being passed as the "current" traffic weights, which the stable replica count is based on. If the current weights are canary=100, stable=0, this would result in a stable replica count of 0. I observed in controller logs that this field can be updated without the weight necessarily being verified: "Patched: {\"status\":{\"canary\":{\"weights\":{\"canary\":{\"weight\":100},\"stable\":{\"weight\":0},\"verified\":false}}}}.
This seems to validate my observations and theory. From my naive perspective, assuming verification is enabled, if we need to make decisions based on traffic weight, we should only do so after it's verified.
Describe the bug
I believe enabling Dynamic Stable Scale with ALB TargetGroup Weight Verification can result in the stable ReplicaSet being scaled down before TargetGroup weight is verified. This can potentially cause traffic to be directed to a TargetGroup with 0 targets, resulting in errors being returned.
Related, enabling Dynamic Stable Scale without ALB TargetGroup Weight Verification most likely has a similar issue, since no verification is performed in the first place.
This is a scenario I've observed several times in our production environment:
I had a brief look at the code to see if I could validate this. Here https://github.com/argoproj/argo-rollouts/blob/master/rollout/canary.go#L116 I can see
c.rollout.Status.Canary.Weights
being passed as the "current" traffic weights, which the stable replica count is based on. If the current weights are canary=100, stable=0, this would result in a stable replica count of 0. I observed in controller logs that this field can be updated without the weight necessarily being verified:"Patched: {\"status\":{\"canary\":{\"weights\":{\"canary\":{\"weight\":100},\"stable\":{\"weight\":0},\"verified\":false}}}}
.This seems to validate my observations and theory. From my naive perspective, assuming verification is enabled, if we need to make decisions based on traffic weight, we should only do so after it's verified.
To Reproduce
A minimal setup could be:
Expected behavior
The number of stable replicas should take into account the verified traffic weight, rather than using the unverified weight.
Version
1.6.6
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.
The text was updated successfully, but these errors were encountered: