Skip to content

Conversation

@forrestIsRunning
Copy link

Replace fixed delay retry with Kubernetes workqueue rate limiter

  • Add TypedItemFastSlowRateLimiter for adaptive retry delays
  • Implement fast/slow retry strategy (100ms for first 5, 500ms after)
  • Add automatic state tracking and cleanup
  • Update maybeRetry() to use rate limiter instead of fixed sleep
  • Clean up rate limiter state on successful processing

Resolves TODO: Use RateLimitingInterface and NewItemFastSlowRateLimiter

Fixes #

Proposed Changes

  • Replace fixed 500ms delay with workqueue.TypedItemFastSlowRateLimiter for adaptive retries
  • Add rateLimiter field to Forwarder to track retry state per stat
  • Update maybeRetry() to use rateLimiter.When() and rateLimiter.NumRequeues()
  • Clean up rate limiter state on success and max retries exceeded
  • Add configurable constants: fastRetryDelay (100ms), slowRetryDelay (500ms), maxFastRetryAttempts (5)

Why these changes?
Fixed delay doesn't adapt to failure scenarios. Fast retries help transient errors recover quickly, while slow retries reduce load for persistent issues. Uses Kubernetes standard workqueue rate limiting.

Release Note
elease-note
feat: implement rate limiting retry mechanism for stat forwarder

Replace fixed delay retry with Kubernetes workqueue rate limiter. Uses fast retries (100ms) for first 5 attempts and slower retries (500ms) after, improving error recovery while reducing system load.

Replace fixed delay retry with Kubernetes workqueue rate limiter

- Add TypedItemFastSlowRateLimiter for adaptive retry delays
- Implement fast/slow retry strategy (100ms for first 5, 500ms after)
- Add automatic state tracking and cleanup
- Update maybeRetry() to use rate limiter instead of fixed sleep
- Clean up rate limiter state on successful processing

Resolves TODO: Use RateLimitingInterface and NewItemFastSlowRateLimiter
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jan 10, 2026

CLA Missing ID

@knative-prow
Copy link

knative-prow bot commented Jan 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: forrestIsRunning
Once this PR has been reviewed and has the lgtm label, please assign dprotaso for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow
Copy link

knative-prow bot commented Jan 10, 2026

Welcome @forrestIsRunning! It looks like this is your first PR to knative/serving 🎉

@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 10, 2026
@knative-prow
Copy link

knative-prow bot commented Jan 10, 2026

Hi @forrestIsRunning. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow bot requested review from dprotaso and skonto January 10, 2026 02:28
@dprotaso
Copy link
Member

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jan 21, 2026
@codecov
Copy link

codecov bot commented Jan 21, 2026

Codecov Report

❌ Patch coverage is 77.77778% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.24%. Comparing base (09e59ec) to head (8e630e1).
⚠️ Report is 27 commits behind head on main.

Files with missing lines Patch % Lines
pkg/autoscaler/statforwarder/forwarder.go 77.77% 3 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #16330      +/-   ##
==========================================
+ Coverage   80.11%   80.24%   +0.12%     
==========================================
  Files         215      216       +1     
  Lines       13391    13443      +52     
==========================================
+ Hits        10728    10787      +59     
+ Misses       2304     2293      -11     
- Partials      359      363       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add rateLimiter field to Forwarder to track retry state per stat

The original retry mechanism has an individual delay per each stat.

Using the workqueue with the key being the revision means that all retried stats of the same revision will effect each other - specifically 5 delayed stats from the same revision will mean the 6th retied stat will be have to wait 500ms.

I think we can drop the workqueue as part of the implementation.

I wonder if it's just simpler to use the s.retry count and adjust the sleep based on that value (without the queue).

Let me know what you think and whether I interpreted your change incorrectly

@dprotaso
Copy link
Member

Also note to contribute to the project you will need to sign the CLA (EasyCLA) which is a LinuxFoundation thing

@dprotaso
Copy link
Member

/hold

@knative-prow knative-prow bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants