-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Description
Which component are you using?:
/area cluster-autoscaler
/area core-autoscaler
/wg device-management
Is your feature request designed to solve a problem? If so describe the problem this feature should solve.:
There are 2 ClusterSnapshotStore implementations with very different performance characteristics:
BasicSnapshotStoreis a very simple, reference implementation that clones the whole state duringFork(). It's easy to understand and can be used e.g. in tests, but the complexity of operations is not optimized for the typical usage patterns during a Cluster Autoscaler loop. Not really intended for production use because of this.DeltaSnapshotStoreis a much more complex implementation, that branches and keeps deltas separately for everyFork(). The complexity of operations is optimized for typical Cluster Autoscaler usage patterns. This is the de-facto production implementation.
In order for DRA autoscaling to work, a ClusterSnapshotStore implementation has to integrate with dynamicresources.Snapshot. This means correctly handling the DRA snapshot during Fork()/Commit()/Revert().
For DRA autoscaling MVP, only BasicSnapshotStore was integrated with dynamicresources.Snapshot. It's pretty trivial in this case - we just need dynamicresources.Snapshot.Clone(). This was enough to test the MVP, but for production use we need to integrate DeltaSnapshotStore as well.
Describe the solution you'd like.:
- Add the ability to chain multiple
dynamicresources.Snapshotobjects, where each object represents a delta from the previous one. - The chain can be queried/modified in the same way as a single
dynamicresources.Snapshot. Queries fall back the chain and are expensive, modifications are applied to the top object in the chain and are cheap. - The chain can be consolidated into a single
dynamicresources.Snapshot. DeltaSnapshotStorekeeps a chain ofdynamicresources.Snapshotobjects, and modifies the chain in the same way as the NodeInfo storage chain duringFork()/Commit()/Revert().- The chain is probably easiest to implement by turning
dynamicresources.Snapshotinto an interface and introducing another "DeltaChain" implementation that uses the current one internally.
Describe any alternative solutions you've considered.:
IMO we should avoid having two completely separate Basic/Delta implementations for dynamicresources.Snapshot like we do for ClusterSnapshotStore. This pattern leads to duplicating large portions of non-trivial code and extending ClusterSnapshotStore is more painful than it should because of it. The Delta/Chain implementation should use the Basic one internally instead.
Additional context.:
This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in kubernetes/kubernetes#118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.