Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Mitigate the bug where items are re-added constantly to the workqueue. #1193 #1243

Merged
merged 5 commits into from
Jun 9, 2021

Conversation

MarkSRobinson
Copy link
Contributor

There is a deep bug where items are added to the rollouts workqueue constantly. This is a problem because there is an exponential back-off for items so each add extends the back-off by a factor of two. The backoff maxes out at 16.6 minutes.

This fix will prevent Argo Rollouts from hanging for up to 16 minutes at a time if this case happens. This change reduces the maximum back-off to 10 seconds.

Checklist:

  • Either (a) I've created an enhancement proposal and discussed it with the community, (b) this is a bug fix, or (c) this is a chore.
  • The title of the PR is (a) conventional, (b) states what changed, and (c) suffixes the related issues number. E.g. "fix(controller): Updates such and such. Fixes #1234".
  • I've signed my commits with DCO
  • I have written unit and/or e2e tests for my change. PRs without these are unlikely to be merged.
  • My builds are green. Try syncing with master if they are not.
  • My organization is added to USERS.md.

@MarkSRobinson MarkSRobinson changed the title [fix] Mitigate the bug where items are re-added constantly to the workqueue. [fix] Mitigate the bug where items are re-added constantly to the workqueue. #1193 Jun 2, 2021
@MarkSRobinson MarkSRobinson changed the title [fix] Mitigate the bug where items are re-added constantly to the workqueue. #1193 fix: Mitigate the bug where items are re-added constantly to the workqueue. #1193 Jun 2, 2021
…queue.

This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout.

Signed-off-by: Mark Robinson <mrobinson@plaid.com>
@codecov
Copy link

codecov bot commented Jun 2, 2021

Codecov Report

Merging #1243 (559d6fb) into master (d9d1237) will increase coverage by 0.02%.
The diff coverage is 83.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1243      +/-   ##
==========================================
+ Coverage   81.40%   81.42%   +0.02%     
==========================================
  Files         106      106              
  Lines        9527     9531       +4     
==========================================
+ Hits         7755     7761       +6     
+ Misses       1251     1250       -1     
+ Partials      521      520       -1     
Impacted Files Coverage Δ
rollout/controller.go 78.35% <66.66%> (-0.10%) ⬇️
...ctl-argo-rollouts/viewcontroller/viewcontroller.go 71.84% <100.00%> (ø)
rollout/trafficrouting/istio/controller.go 45.73% <100.00%> (+2.32%) ⬆️
utils/controller/controller.go 82.92% <100.00%> (+0.13%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d9d1237...559d6fb. Read the comment docs.

Copy link
Member

@jessesuen jessesuen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great find! Could you describe the scenario where this happens? I'm just surprised we have not come across this.

Also, can you fix linting errors? After that it looks good to merge.

@MarkSRobinson
Copy link
Contributor Author

I'm not entirely sure since it can be hard to reproduce, but the big thing is high pod count (>20) and analysis runs that are frequent and don't terminate. So every 10s for analysis checks. It also correlates with long deployment times (>10m)

Signed-off-by: Mark Robinson <mrobinson@plaid.com>
Signed-off-by: Mark Robinson <mrobinson@plaid.com>
Signed-off-by: Mark Robinson <mrobinson@plaid.com>
@sonarcloud
Copy link

sonarcloud bot commented Jun 9, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@jessesuen jessesuen merged commit 79739fd into argoproj:master Jun 9, 2021
caoyang001 pushed a commit to caoyang001/argo-rollouts that referenced this pull request Jun 12, 2021
…queue. argoproj#1193 (argoproj#1243)

This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout.

Signed-off-by: Mark Robinson <mrobinson@plaid.com>
Signed-off-by: caoyang001 <caoyang001@foxmail.com>
huikang pushed a commit to huikang/argo-rollouts that referenced this pull request Sep 16, 2021
…queue. argoproj#1193 (argoproj#1243)

This will prevent argo from hanging for up to 16 minutes at a time while processing a rollout.

Signed-off-by: Mark Robinson <mrobinson@plaid.com>
Signed-off-by: caoyang001 <caoyang001@foxmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants