Computation of expendable pods does not consider preemption policy #6227

unmarshall · 2023-10-26T07:15:10Z

Which component are you using?:
cluster-autoscaler

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.7", GitCommit:"84e1fc493a47446df2e155e70fca768d2653a398", GitTreeState:"clean", BuildDate:"2023-07-19T12:16:45Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

What did you expect to happen?:

If there is an unschedulable pod, then CA attempts to determine if this pod should be considered as unschedulable (as indicated by the kube-scheduler) or it can be scheduled onto an existing node. This check is done in every run of the static_autoscaler.go - this is done by the podlistprocessor. To do this it also runs a scheduler simulation and one of the filter plugins that will be called is scheduler/framework/plugins/noderesources/fit.go. This is where the requests from the Pod are matched against the NodeInfo which is constructed by CA.

I have changed the name of the pod and node for easy reference elsewhere.

In our case kube-scheduler failed to schedule the pod with the following captured as part of the Pod status:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-10-25T10:25:21Z"
    message: '0/13 nodes are available: 2 Insufficient memory, 2 node(s) had untolerated
      taint {taint-name: arm64}, 9 node(s) didn''t match Pod''s node affinity/selector.
      preemption: not eligible due to preemptionPolicy=Never..'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending

Where as CA suggested that it is possible to schedule this pod onto an existing node.

I1026 03:55:42.730901       1 filter_out_schedulable.go:63] Filtering out schedulables
I1026 03:55:42.730914       1 hinting_simulator.go:63] Looking for place for default/test-pod
I1026 03:55:42.731025       1 hinting_simulator.go:77] Pod default/test-pod can be moved to test-node
I1026 03:55:42.731043       1 filter_out_schedulable.go:120] 1 pods marked as unschedulable can be scheduled.
I1026 03:55:42.731057       1 filter_out_schedulable.go:75] Schedulable pods present
I1026 03:55:42.731079       1 static_autoscaler.go:512] No unschedulable pods

The node test-node in question already had 18 pods deployed onto it and it had the following allocatable resources:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests            Limits
  --------           --------            ------
  cpu                3566m (45%)         17520m (221%)
  memory             129289570250 (96%)  130410Mi (102%)

The test-pod was requesting the following resources:

  requests:
     cpu: "2"
     memory: 18000Mi

This would never fit the test-node and kube-scheduler was correct in determining that. But still CA was suggesting that this pod should be scheduled onto this test-node

When we looked closer at the pods that were populated in the ClusterSnapshot that is constructed by CA we noticed that it only listed 17/18 pods and thus its computation of total requested resources on the Node was not accurate.

We found the deployed pod that was not listed/considered by CA and for the computation of total requests on the node. This pod (lets call it exp-pod-1) had the following in its Spec:

 preemptionPolicy: Never
 priority: -15

The cluster-autoscaler was started (among other flags) with:

--expendable-pods-priority-cutoff=-10

When we looked expendable.go we noticed that CA filters out pods when computing total requested resources on a node which satisfy:

pod.Spec.Priority != nil && int(*pod.Spec.Priority) < expendablePodsPriorityCutoff

This is incomplete as it should consider the preemption policy of the unschedulable pods as well as that is considered by the kube-scheduler. Without that there is always going to be a difference between the decision taken by CA and kube-scheduler and it will result in no scale-up when it should have been done in this case.

What happened instead?:
We expected a scale up to happen as test-pod was marked as unschedulable by the kube-scheduler and the node that CA suggested for this pod did not sufficient memory.

How to reproduce it (as minimally and precisely as possible):

start cluster-autoscaler with --expendable-pods-priority-cutoff=-10
Create a pod with:

 preemptionPolicy: Never
 priority: -15

Make sure that this pod gets deployed on to a node.
3. Create another pod with requests higher than what is available on this node. This pod should have preemptionPolicy: Never

The text was updated successfully, but these errors were encountered:

unmarshall · 2023-10-30T06:54:22Z

Similar issue was reported earlier and was closed without any comments/activity.

himanshu-kun · 2023-10-30T07:02:30Z

@vadasambar could you help here.

vadasambar · 2023-10-30T18:06:13Z

👀

vadasambar · 2023-10-31T19:36:22Z

Sounds like initializing clusterSnapshot with expendable and non expandable pods might fix the issue. Might have to assess the impact of the change though.

vadasambar · 2023-11-01T17:47:19Z

/assign vadasambar

vadasambar · 2023-11-14T17:24:15Z

WIP PR: https://github.com/kubernetes/autoscaler/pull/6277/files

rishabh-11 · 2024-01-03T05:48:07Z

@vadasambar any progress on this issue? Is it possible to give a timeline for this?

vadasambar · 2024-01-04T04:11:56Z

@vadasambar any progress on this issue? Is it possible to give a timeline for this?

Sorry @rishabh-11 bogged down by other things. I might find some time for this next week.

unmarshall · 2024-01-19T06:32:20Z

/label cluster-autoscaler

k8s-ci-robot · 2024-01-19T06:32:22Z

@unmarshall: The label(s) /label cluster-autoscaler cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label cluster-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

vadasambar · 2024-02-20T06:05:25Z

On a second look, I am not sure what CA can do here or if this is a problem at all.
You don't see the same behavior as kube-scheduler because of --expendable-pods-priority-cutoff. If you remove the flag or reduce the cutoff value, you should start seeing the behavior you expect. If we start considering the exp-pod-1 in the simulation, it is no longer an expendable pod. You can make that happen by lowering the value for the cutoff flag. Maybe a way to get around this could be, we specify an annotation to tell CA that a particular pod should not be considered expendable even if it has a priority lower than the cutoff. It seems like --expendable-pods-priority-cutoff is too general and applies to all the pods whereas there can be cases like these where you do want to consider certain pods as non-expendable even if they have a priority lower than the cutoff. That being said, I do wonder if this problem can't be solved just by reducing the cutoff value or increasing the pod priority from -15 to a value higher than -10.

Let me know what you think @unmarshall

unmarshall · 2024-02-20T11:19:20Z

I think the real issue is not so much with --expendable-pods-priority-cutoff but with CA not looking at the preemptionPolicy defined in priorityClass associated to a pod which is currently Unschedulable. This pod cannot preempt any pod currently deployed on to a node irrespective of priority of deployed pods. So while computing available resources for existing nodes there is no point in removing a pod which has a lower priority than --expendable-pods-priority-cutoff. Even if CA incorrectly determines such pods and increases the available capacity of an existing node and thereby does not do any scale up of the node group, the pod will perpetually remain in unscheduled state because kube-scheduler cannot evict any pod from any node.

Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#non-preempting-priority-class

BigDarkClown · 2024-02-28T10:15:25Z

I agree with @unmarshall, the real issue is that the pod priority on its own does not mean anything if the preemption policy is set to never. The suggested solution to this would be to change the check to be something like

pod.Spec.Priority != nil && int(*pod.Spec.Priority) < expendablePodsPriorityCutoff && prremptionPolicy(pod) != "never"

unmarshall added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2023

k8s-ci-robot assigned vadasambar Nov 1, 2023

vadasambar mentioned this issue Nov 14, 2023

fix: wip clusterstate doesn't consider expendable pods #6277

Closed

vadasambar mentioned this issue Nov 14, 2023

Nov 2023 vadafoss/daily-updates#15

Closed

vadasambar mentioned this issue Jan 11, 2024

Jan 2024 vadafoss/daily-updates#17

Closed

unmarshall mentioned this issue Jan 19, 2024

☂️ CA-MCM Overhaul gardener/machine-controller-manager#895

Open

unmarshall mentioned this issue Jan 23, 2024

POC: Evaluate using a proxy kube-scheduler to run simulations gardener/machine-controller-manager#897

Closed

vadasambar mentioned this issue Feb 20, 2024

Feb 2024 vadafoss/daily-updates#18

Open

BigDarkClown mentioned this issue Feb 28, 2024

Consider preemption policy for expandable pods #6577

Merged

k8s-ci-robot closed this as completed in #6577 Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Computation of expendable pods does not consider preemption policy #6227

Computation of expendable pods does not consider preemption policy #6227

unmarshall commented Oct 26, 2023 •

edited

Loading

unmarshall commented Oct 30, 2023

himanshu-kun commented Oct 30, 2023

vadasambar commented Oct 30, 2023

vadasambar commented Oct 31, 2023

vadasambar commented Nov 1, 2023

vadasambar commented Nov 14, 2023

rishabh-11 commented Jan 3, 2024

vadasambar commented Jan 4, 2024 •

edited

Loading

unmarshall commented Jan 19, 2024

k8s-ci-robot commented Jan 19, 2024

vadasambar commented Feb 20, 2024

unmarshall commented Feb 20, 2024

BigDarkClown commented Feb 28, 2024

Computation of expendable pods does not consider preemption policy #6227

Computation of expendable pods does not consider preemption policy #6227

Comments

unmarshall commented Oct 26, 2023 • edited Loading

unmarshall commented Oct 30, 2023

himanshu-kun commented Oct 30, 2023

vadasambar commented Oct 30, 2023

vadasambar commented Oct 31, 2023

vadasambar commented Nov 1, 2023

vadasambar commented Nov 14, 2023

rishabh-11 commented Jan 3, 2024

vadasambar commented Jan 4, 2024 • edited Loading

unmarshall commented Jan 19, 2024

k8s-ci-robot commented Jan 19, 2024

vadasambar commented Feb 20, 2024

unmarshall commented Feb 20, 2024

BigDarkClown commented Feb 28, 2024

unmarshall commented Oct 26, 2023 •

edited

Loading

vadasambar commented Jan 4, 2024 •

edited

Loading