CA DRA: review DRA-related error policy

**Which component are you using?**:

/area cluster-autoscaler
/area core-autoscaler
/wg device-management

**Is your feature request designed to solve a problem? If so describe the problem this feature should solve.**:

Cluster Autoscaler tends to error out and break the whole loop in case of any unexpected errors, and the [DRA MVP PR](https://github.com/kubernetes/autoscaler/pull/7530) mostly follows this approach for simplicity. This is not a good direction in general, we've had a number of issues in GKE CA where a bug related to a small subset of pods/nodes would break CA completely because of it.

**Describe the solution you'd like.**:

We should holistically rethink if CA can proceed with the loop when it encounters DRA-related errors (and ideally non-DRA-related errors as well but that's a separate issue).

**Additional context.**:

This is a part of Dynamic Resource Allocation (DRA) support in Cluster Autoscaler. An MVP of the support was implemented in #7530 (with the whole implementation tracked in https://github.com/kubernetes/kubernetes/issues/118612). There are a number of post-MVP follow-ups to be addressed before DRA autoscaling is ready for production use - this is one of them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA DRA: review DRA-related error policy #7784

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

CA DRA: review DRA-related error policy #7784

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions