Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Stable Scale scales down the stable ReplicaSet before verifying ALB TargetGroup weight #3926

Open
mharmer-canva opened this issue Oct 31, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@mharmer-canva
Copy link

mharmer-canva commented Oct 31, 2024

Describe the bug

I believe enabling Dynamic Stable Scale with ALB TargetGroup Weight Verification can result in the stable ReplicaSet being scaled down before TargetGroup weight is verified. This can potentially cause traffic to be directed to a TargetGroup with 0 targets, resulting in errors being returned.

Related, enabling Dynamic Stable Scale without ALB TargetGroup Weight Verification most likely has a similar issue, since no verification is performed in the first place.

This is a scenario I've observed several times in our production environment:

  • Traffic weight is updated from 50% canary to 100% canary
  • There is a delay updating the underlying TargetGroups (the scenario that TargetGroup weight verification is intended to mitigate)
  • Rollouts controller reconciles the stable ReplicaSet and scales it down to 0 replicas
  • This results in a state where the TargetGroup has 0 targets but is still receiving 50% of traffic, causing errors to be returned
  • Some time later, the underlying TargetGroup weights are updated and the desired weight is verified

I had a brief look at the code to see if I could validate this. Here https://github.com/argoproj/argo-rollouts/blob/master/rollout/canary.go#L116 I can see c.rollout.Status.Canary.Weights being passed as the "current" traffic weights, which the stable replica count is based on. If the current weights are canary=100, stable=0, this would result in a stable replica count of 0. I observed in controller logs that this field can be updated without the weight necessarily being verified:
"Patched: {\"status\":{\"canary\":{\"weights\":{\"canary\":{\"weight\":100},\"stable\":{\"weight\":0},\"verified\":false}}}}.

This seems to validate my observations and theory. From my naive perspective, assuming verification is enabled, if we need to make decisions based on traffic weight, we should only do so after it's verified.

To Reproduce

A minimal setup could be:

  1. Setup Argo Rollouts + Rollout resource with traffic routing enabled, AWS ALB integration, ping/pong configured, Dynamic Stable Scale enabled, TargetGroup weight verification enabled
  2. Configure setCanaryScale + setWeight steps to scale the canary RS to 100% and shift 100% of traffic
  3. Somehow delay or prevent weight verification
  4. Observe that stable replica count is set based on the unverified weight

Expected behavior

The number of stable replicas should take into account the verified traffic weight, rather than using the unverified weight.

Version

1.6.6


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

@mharmer-canva mharmer-canva added the bug Something isn't working label Oct 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant