-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Computation of expendable pods does not consider preemption policy #6227
Comments
Similar issue was reported earlier and was closed without any comments/activity. |
@vadasambar could you help here. |
👀 |
Sounds like initializing clusterSnapshot with expendable and non expandable pods might fix the issue. Might have to assess the impact of the change though. |
/assign vadasambar |
@vadasambar any progress on this issue? Is it possible to give a timeline for this? |
Sorry @rishabh-11 bogged down by other things. I might find some time for this next week. |
/label cluster-autoscaler |
@unmarshall: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
On a second look, I am not sure what CA can do here or if this is a problem at all. Let me know what you think @unmarshall |
I think the real issue is not so much with |
I agree with @unmarshall, the real issue is that the pod priority on its own does not mean anything if the preemption policy is set to
|
Which component are you using?:
cluster-autoscaler
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat did you expect to happen?:
If there is an unschedulable pod, then CA attempts to determine if this pod should be considered as unschedulable (as indicated by the kube-scheduler) or it can be scheduled onto an existing node. This check is done in every run of the
static_autoscaler.go
- this is done by the podlistprocessor. To do this it also runs a scheduler simulation and one of the filter plugins that will be called isscheduler/framework/plugins/noderesources/fit.go
. This is where the requests from the Pod are matched against theNodeInfo
which is constructed by CA.In our case kube-scheduler failed to schedule the pod with the following captured as part of the Pod status:
Where as CA suggested that it is possible to schedule this pod onto an existing node.
The node
test-node
in question already had 18 pods deployed onto it and it had the following allocatable resources:The
test-pod
was requesting the following resources:This would never fit the
test-node
and kube-scheduler was correct in determining that. But still CA was suggesting that this pod should be scheduled onto thistest-node
When we looked closer at the pods that were populated in the
ClusterSnapshot
that is constructed by CA we noticed that it only listed 17/18 pods and thus its computation of total requested resources on the Node was not accurate.We found the deployed pod that was not listed/considered by CA and for the computation of total requests on the node. This pod (lets call it
exp-pod-1
) had the following in its Spec:The cluster-autoscaler was started (among other flags) with:
When we looked expendable.go we noticed that CA filters out pods when computing total requested resources on a node which satisfy:
This is incomplete as it should consider the preemption policy of the unschedulable pods as well as that is considered by the kube-scheduler. Without that there is always going to be a difference between the decision taken by CA and kube-scheduler and it will result in no scale-up when it should have been done in this case.
What happened instead?:
We expected a scale up to happen as
test-pod
was marked as unschedulable by the kube-scheduler and the node that CA suggested for this pod did not sufficient memory.How to reproduce it (as minimally and precisely as possible):
--expendable-pods-priority-cutoff=-10
Make sure that this pod gets deployed on to a node.
3. Create another pod with requests higher than what is available on this node. This pod should have
preemptionPolicy: Never
The text was updated successfully, but these errors were encountered: