Interaction of HPA + progressDeadlineSeconds flag can cause canary failure during peak times

Hi guys, we are having the following issue, currently. What is the suggested solution for this case?


### Describe the issue

Flagger doesn't progress the canary while the number of running pods is below expected -> OK.
But during peak times (and also during regular canary traffic shift), a deployment with an enabled HPA can have an aggressive upscale policy in place, which constantly may change the number of expected pods due to increased traffic.
This makes flagger wait for a long time and the `progressDeadlineSeconds` can be triggered and fail the deployment.

### To Reproduce

Start a canary deployment and keep increasing HPA number of expected pods, so flagger will wait forever until this is stabilised, eventually triggering the deadline failure.

### Expected behavior

Once the pods with new version are ready, flagger should not count subsequent upscales as `progressDeadlineSeconds` and should just wait until pods are ready without triggering a rollback (as this was not caused by the new version).

### Additional context

- Increasing the deadline can lead to slower reactions in case of real problems during pod startup
- Flagger version: 1.6.4
- Kubernetes version:  1.18 
- Service Mesh provider: Linkerd
- Ingress provider: -

Thanks


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Interaction of HPA + progressDeadlineSeconds flag can cause canary failure during peak times #944

Describe the issue

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Interaction of HPA + progressDeadlineSeconds flag can cause canary failure during peak times #944

Description

Describe the issue

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions