Description
How to categorize this issue?
/area performance
/kind bug
/priority 2
What happened:
Autoscaler's fixNodeGroupSize
logic interferes with meltdown logic where we remove only maxReplacement machines per machinedeployment, and it removes the other Unknown
machines as well.
What you expected to happen:
Autoscaler even on taking decision of DecreaseTargetSize
should not be able to remove Unknown
machines, because the node object is actually present for them.
How to reproduce it (as minimally and precisely as possible):
- Create a machinedeployment with 2 replicas (its assumed autoscaler is enabled for the cluster)
- block all traffic to/from the zone machinedeployment is for
- with default maxReplacement 1 node will stay in Pending state
- after around 20 min , the Unknown machine would be deleted when autoscaler fixes the node grp size by reducing machinedeployment replicas to 1
Anything else we need to know?:
This is happening because the way machineSet prioritizes machine while deletion based on their status
machine-controller-manager/pkg/controller/controller_utils.go
Lines 769 to 776 in d7e3c5d
*We need to look into any other implication of prioritizing Pending
machine over Unknown
machines for solution.
Environment:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- Others:
CA version 1.23.1
Activity