Skip to content

Improve csi-snapshotter VolumeSnapshotContent requeue fairness #1282

Open
@pwschuurman

Description

@pwschuurman

Is your feature request related to a problem?/Why is this needed

This enhancement is to improve the requeue behavior for syncing VolumeSnapshotContent resources.

VolumeSnapshotContent resources are reconciled via the contentQueue. For snapshots that are long running, this can be very useful to reduce the amount of polling required to determine if a snapshot is readyToUse=true. However the exponential nature of this backoff can result in the contentQueue rate limiter quickly reaching the maximum. The current default is 1 second, and the current maximum is 300 seconds. This only requires [9 requeue events] to reach the maximum. This limit can quickly be reached today, if a VolumeSnapshotContent is updated. Updates (especially re-entrant updates) trigger resync and requeue, which can quickly bump up the rate limiter retry number, resulting in long requeue wait times.

Describe the solution you'd like in detail

There are two things that should be fixed here:

  1. Prevent updates from bumping the requeue rate limiter limit: Ideally, a additional call to contentQueue.AddRateLimited() should not increase the rate limiter exponent if an item is already scheduled to be requeued. It should either maintain the same requeue schedule, or be adjusted to requeue further into the future, but with the same backoff exponent.
  2. Reduce the number of re-entrant updates. This can reduce the number of requeues (which can lead to the problem above). Some updates are necessary for tracking the lifecycle VolumeSnapshotContent. However it appears that the snapshot.storage.kubernetes.io/volumesnapshot-being-created annotation can be removed early during, prior to the snapshot actually being marked as readyToUse.

Describe alternatives you've considered

A quick fix alternative is just to decrease the max exponential backoff of contentQueue to a lower default (eg: 30 seconds, 60 seconds). This can be used by a CO to reduce the likelihood of higher latency VolumeSnapshotContent reconciliation.

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions