Skip to content

Interaction of HPA + progressDeadlineSeconds flag can cause canary failure during peak times #944

@marcelovcpereira

Description

@marcelovcpereira

Hi guys, we are having the following issue, currently. What is the suggested solution for this case?

Describe the issue

Flagger doesn't progress the canary while the number of running pods is below expected -> OK.
But during peak times (and also during regular canary traffic shift), a deployment with an enabled HPA can have an aggressive upscale policy in place, which constantly may change the number of expected pods due to increased traffic.
This makes flagger wait for a long time and the progressDeadlineSeconds can be triggered and fail the deployment.

To Reproduce

Start a canary deployment and keep increasing HPA number of expected pods, so flagger will wait forever until this is stabilised, eventually triggering the deadline failure.

Expected behavior

Once the pods with new version are ready, flagger should not count subsequent upscales as progressDeadlineSeconds and should just wait until pods are ready without triggering a rollback (as this was not caused by the new version).

Additional context

  • Increasing the deadline can lead to slower reactions in case of real problems during pod startup
  • Flagger version: 1.6.4
  • Kubernetes version: 1.18
  • Service Mesh provider: Linkerd
  • Ingress provider: -

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is neededkind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions