Description
Is your feature request related to a problem?/Why is this needed
This enhancement is to improve the requeue
behavior for syncing VolumeSnapshotContent resources.
VolumeSnapshotContent resources are reconciled via the contentQueue. For snapshots that are long running, this can be very useful to reduce the amount of polling required to determine if a snapshot is readyToUse=true
. However the exponential nature of this backoff can result in the contentQueue rate limiter quickly reaching the maximum. The current default is 1 second
, and the current maximum is 300 seconds
. This only requires [9 requeue events] to reach the maximum. This limit can quickly be reached today, if a VolumeSnapshotContent is updated. Updates (especially re-entrant updates) trigger resync and requeue, which can quickly bump up the rate limiter retry number, resulting in long requeue wait times.
Describe the solution you'd like in detail
There are two things that should be fixed here:
- Prevent updates from bumping the
requeue
rate limiter limit: Ideally, a additional call tocontentQueue.AddRateLimited()
should not increase the rate limiter exponent if an item is already scheduled to be requeued. It should either maintain the same requeue schedule, or be adjusted to requeue further into the future, but with the same backoff exponent. - Reduce the number of re-entrant updates. This can reduce the number of requeues (which can lead to the problem above). Some updates are necessary for tracking the lifecycle VolumeSnapshotContent. However it appears that the
snapshot.storage.kubernetes.io/volumesnapshot-being-created
annotation can be removed early during, prior to the snapshot actually being marked asreadyToUse
.
Describe alternatives you've considered
A quick fix alternative is just to decrease the max exponential backoff of contentQueue
to a lower default (eg: 30 seconds, 60 seconds). This can be used by a CO to reduce the likelihood of higher latency VolumeSnapshotContent reconciliation.
Additional context