Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computation of expendable pods does not consider preemption policy #6227

Closed
unmarshall opened this issue Oct 26, 2023 · 13 comments · Fixed by #6577
Closed

Computation of expendable pods does not consider preemption policy #6227

unmarshall opened this issue Oct 26, 2023 · 13 comments · Fixed by #6577
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@unmarshall
Copy link

unmarshall commented Oct 26, 2023

Which component are you using?:
cluster-autoscaler

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
Server Version: version.Info{Major:"1", Minor:"26", GitVersion:"v1.26.7", GitCommit:"84e1fc493a47446df2e155e70fca768d2653a398", GitTreeState:"clean", BuildDate:"2023-07-19T12:16:45Z", GoVersion:"go1.20.6", Compiler:"gc", Platform:"linux/amd64"}

What did you expect to happen?:

If there is an unschedulable pod, then CA attempts to determine if this pod should be considered as unschedulable (as indicated by the kube-scheduler) or it can be scheduled onto an existing node. This check is done in every run of the static_autoscaler.go - this is done by the podlistprocessor. To do this it also runs a scheduler simulation and one of the filter plugins that will be called is scheduler/framework/plugins/noderesources/fit.go. This is where the requests from the Pod are matched against the NodeInfo which is constructed by CA.

I have changed the name of the pod and node for easy reference elsewhere.

In our case kube-scheduler failed to schedule the pod with the following captured as part of the Pod status:

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-10-25T10:25:21Z"
    message: '0/13 nodes are available: 2 Insufficient memory, 2 node(s) had untolerated
      taint {taint-name: arm64}, 9 node(s) didn''t match Pod''s node affinity/selector.
      preemption: not eligible due to preemptionPolicy=Never..'
    reason: Unschedulable
    status: "False"
    type: PodScheduled
  phase: Pending

Where as CA suggested that it is possible to schedule this pod onto an existing node.

I1026 03:55:42.730901       1 filter_out_schedulable.go:63] Filtering out schedulables
I1026 03:55:42.730914       1 hinting_simulator.go:63] Looking for place for default/test-pod
I1026 03:55:42.731025       1 hinting_simulator.go:77] Pod default/test-pod can be moved to test-node
I1026 03:55:42.731043       1 filter_out_schedulable.go:120] 1 pods marked as unschedulable can be scheduled.
I1026 03:55:42.731057       1 filter_out_schedulable.go:75] Schedulable pods present
I1026 03:55:42.731079       1 static_autoscaler.go:512] No unschedulable pods

The node test-node in question already had 18 pods deployed onto it and it had the following allocatable resources:

Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource           Requests            Limits
  --------           --------            ------
  cpu                3566m (45%)         17520m (221%)
  memory             129289570250 (96%)  130410Mi (102%)

The test-pod was requesting the following resources:

  requests:
     cpu: "2"
     memory: 18000Mi

This would never fit the test-node and kube-scheduler was correct in determining that. But still CA was suggesting that this pod should be scheduled onto this test-node

When we looked closer at the pods that were populated in the ClusterSnapshot that is constructed by CA we noticed that it only listed 17/18 pods and thus its computation of total requested resources on the Node was not accurate.

We found the deployed pod that was not listed/considered by CA and for the computation of total requests on the node. This pod (lets call it exp-pod-1) had the following in its Spec:

 preemptionPolicy: Never
 priority: -15

The cluster-autoscaler was started (among other flags) with:

--expendable-pods-priority-cutoff=-10

When we looked expendable.go we noticed that CA filters out pods when computing total requested resources on a node which satisfy:

pod.Spec.Priority != nil && int(*pod.Spec.Priority) < expendablePodsPriorityCutoff

This is incomplete as it should consider the preemption policy of the unschedulable pods as well as that is considered by the kube-scheduler. Without that there is always going to be a difference between the decision taken by CA and kube-scheduler and it will result in no scale-up when it should have been done in this case.

What happened instead?:
We expected a scale up to happen as test-pod was marked as unschedulable by the kube-scheduler and the node that CA suggested for this pod did not sufficient memory.

How to reproduce it (as minimally and precisely as possible):

  1. start cluster-autoscaler with --expendable-pods-priority-cutoff=-10
  2. Create a pod with:
 preemptionPolicy: Never
 priority: -15

Make sure that this pod gets deployed on to a node.
3. Create another pod with requests higher than what is available on this node. This pod should have preemptionPolicy: Never

@unmarshall unmarshall added the kind/bug Categorizes issue or PR as related to a bug. label Oct 26, 2023
@unmarshall
Copy link
Author

Similar issue was reported earlier and was closed without any comments/activity.

@himanshu-kun
Copy link

@vadasambar could you help here.

@vadasambar
Copy link
Member

👀

@vadasambar
Copy link
Member

Sounds like initializing clusterSnapshot with expendable and non expandable pods might fix the issue. Might have to assess the impact of the change though.

@vadasambar
Copy link
Member

/assign vadasambar

@vadasambar
Copy link
Member

@rishabh-11
Copy link

@vadasambar any progress on this issue? Is it possible to give a timeline for this?

@vadasambar
Copy link
Member

vadasambar commented Jan 4, 2024

@vadasambar any progress on this issue? Is it possible to give a timeline for this?

Sorry @rishabh-11 bogged down by other things. I might find some time for this next week.

@unmarshall
Copy link
Author

/label cluster-autoscaler

@k8s-ci-robot
Copy link
Contributor

@unmarshall: The label(s) /label cluster-autoscaler cannot be applied. These labels are supported: api-review, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, team/katacoda, refactor. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label cluster-autoscaler

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@vadasambar
Copy link
Member

On a second look, I am not sure what CA can do here or if this is a problem at all.
You don't see the same behavior as kube-scheduler because of --expendable-pods-priority-cutoff. If you remove the flag or reduce the cutoff value, you should start seeing the behavior you expect. If we start considering the exp-pod-1 in the simulation, it is no longer an expendable pod. You can make that happen by lowering the value for the cutoff flag. Maybe a way to get around this could be, we specify an annotation to tell CA that a particular pod should not be considered expendable even if it has a priority lower than the cutoff. It seems like --expendable-pods-priority-cutoff is too general and applies to all the pods whereas there can be cases like these where you do want to consider certain pods as non-expendable even if they have a priority lower than the cutoff. That being said, I do wonder if this problem can't be solved just by reducing the cutoff value or increasing the pod priority from -15 to a value higher than -10.

Let me know what you think @unmarshall

@unmarshall
Copy link
Author

I think the real issue is not so much with --expendable-pods-priority-cutoff but with CA not looking at the preemptionPolicy defined in priorityClass associated to a pod which is currently Unschedulable. This pod cannot preempt any pod currently deployed on to a node irrespective of priority of deployed pods. So while computing available resources for existing nodes there is no point in removing a pod which has a lower priority than --expendable-pods-priority-cutoff. Even if CA incorrectly determines such pods and increases the available capacity of an existing node and thereby does not do any scale up of the node group, the pod will perpetually remain in unscheduled state because kube-scheduler cannot evict any pod from any node.

Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/pod-priority-preemption/#non-preempting-priority-class

@BigDarkClown
Copy link
Contributor

I agree with @unmarshall, the real issue is that the pod priority on its own does not mean anything if the preemption policy is set to never. The suggested solution to this would be to change the check to be something like

pod.Spec.Priority != nil && int(*pod.Spec.Priority) < expendablePodsPriorityCutoff && prremptionPolicy(pod) != "never"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
6 participants