☂️ CA-MCM Overhaul

**How to categorize this issue?**
/area control-plane
/kind epic
/priority 3

**What would you like to be added**:
_TO BE FILLED_

**Why is this needed**:
There are several reasons why we need to relook at CA and MCM in isolation and their interplay.
- MCM code base over the period of time has become quite complex and difficult to maintain.
- CA
   - We maintain a fork of CA in order to include MCM provider. There is a periodic effort to sync and release with every new k8s version.
   - Over a period of time our fork's divergence from the upstream has increased, where-in we have started to alter core codebase of CA (provider agnostic). One such issue: https://github.com/gardener/autoscaler/issues/99 highlights this need and https://github.com/gardener/autoscaler/issues/30 commented `fixNodeGroupSize` logic in core CA.
   - With over 90+ CLI options it is a bit tricky to tune CA for any consumer. 
   - The design philosophy of CA centres around creation of node groups which has a 1:1 correspondence with a specific machine type and zone. This is limiting as consumers wish to be more flexible w.r.t machine types across zones as resource quotas per machine type in any specific zone are always a challenge.
   - Spot instances across providers recommend to have greater flexibility w.r.t machine types, zones and regions to ensure greater probabilities to get spot instances and reduce the possibility for spot evictions. This can technically be realised using node groups but it can easily lead to a combinatorial explosion of node groups and complicated expander rules.
   - Scheduler module that is used in CA differs in configuration from `kube-scheduler` thus creating chances of differing outcomes w.r.t pod scheduling. One such issue: https://github.com/kubernetes/autoscaler/issues/6227 was raised recently.
- CA and MCM have overlap w.r.t functionalities that it offers. This results in race conditions due to concurrent actions taken by CA and MCM. This further leads to over complicating the code base to handle these. For instance https://github.com/gardener/autoscaler/issues/181 was raised highlighting one such issue.
- There is also an [ask](https://github.com/kubernetes/autoscaler/issues/5394) to make CA into a library but due to a massive effort this is unlikely to be taken up.
- Due to several binaries (CA running in a separate pod), MCM(having 2 containers in a single pod) we have the following problems:
   -  End-To-End traceability is often a challenge and a lot of time is spent in analysing issues.
   -  Setting up a debugging session across binaries is a challenge. An issue:https://github.com/gardener/autoscaler/issues/185 was recently raised for this.

> The quantum of change and the new direction (if any) proposed as part of this epic will also have an impact on the on-ongoing discussions on how to enhance worker pool configurations: https://github.com/gardener/gardener/issues/8142 and an internal (draft) proposal on `Enhancing Gardener's Worker Pool Configuration`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

☂️ CA-MCM Overhaul #895

unmarshall
openedon Jan 19, 2024

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

☂️ CA-MCM Overhaul #895

Description

unmarshallopenedon Jan 19, 2024

Activity

Metadata