Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pending pod triggers new node instead of evict a pod with lower priority #1410

Closed
mmingorance-dh opened this issue Nov 16, 2018 · 23 comments
Closed

Comments

@mmingorance-dh
Copy link

mmingorance-dh commented Nov 16, 2018

Hi all,
Yesterday we started to test cluster-autoscaler with priorityClasses and podPriority to always have some available extra capacity in our cluster, however whenever a new pod comes up and is in pending state, this one triggers a new node with cluster-autoscaler instead of replacing any pod from the "paused" deployment running with lower priorityClass.

This is configuration I added to my cluster:

    authorizationMode: RBAC
    authorizationRbacSuperUser: admin
    runtimeConfig:
      scheduling.k8s.io/v1alpha1: "true"
      admissionregistration.k8s.io/v1beta1: "true"
      autoscaling/v2beta1: "true"

kubelet:
    featureGates:
      PodPriority: "true"
 
 
kubeAPIServer:
    featureGates:
      PodPriority: "true"

kubeAPIServer:
   admissionControl:
   - Priority
   - NamespaceLifecycle
    - LimitRanger
    - ServiceAccount
    - PersistentVolumeLabel
    - DefaultStorageClass
    - ResourceQuota
    - DefaultTolerationSeconds

kubeControllerManager:
    horizontalPodAutoscalerDownscaleDelay: 1h0m0s
    horizontalPodAutoscalerUseRestClients: true

As I could see in my masters, these features seems to be enabled:
--feature-gates=PodPriority=true
--enable-admission-plugins=Priority,NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds,ValidatingAdmissionWebhook,NodeRestriction,ResourceQuota

Is there anything else that I'm missing in my config?
The overscaling deployment is the same one you can find in cluster-autoscaler FAQ.

@losipiuk
Copy link
Contributor

Thanks for reporting that.

Are the "paused" pods actually being created?
Please ensure that there is actually a single node with enough "paused" pods with lower priority, that preempting them will free enough resources to schedule the "workload" pod.
If both are true the problem is with scheduler configuration I guess. If not we need to fix that first :)

@mmingorance-dh
Copy link
Author

mmingorance-dh commented Nov 16, 2018

Thanks for helping.

The "paused" pods are getting created and they take over of 6000m CPU and 10000Mi Memory each.
I run 10 pods in the deployment, so I guess that should be enough space for any other pod of our private applications to be created.

By the way, we are using kops 1.10 and cluster-autoscaler 1.3.0

@losipiuk
Copy link
Contributor

so I guess that should be enough space for any other pod of our private applications to be created.

I don't know :) Depends how be resource requests the applications have.
Also to be sure. The application pods have the priority set to default (0)?

Also could you please check if nominatedNodeName is set in status on "workload" pods?

@losipiuk
Copy link
Contributor

I am not sure if that is possible but you may also try to run scheduler with log verbosity set to at least 3 and look for preemption related log messages.
This way we could rule out that preemption is not enabled.
If that is the case you should see "Pod priority feature is not enabled or preemption is disabled by scheduler configuration." in logs.

Btw. Which version of k8s and CA are you using?

@mmingorance-dh
Copy link
Author

we are using kops 1.10 (with Kubernetes 1.10.6) and cluster-autoscaler 1.3.0
I found out that in Kops you can also enable the following field:

kubeScheduler:
   featureGates
     podPriority: "true"

Maybe that's the reason why my "paused" pods are not being rescheduled. I will enable this field and I'll let you know.

@mmingorance-dh
Copy link
Author

Regarding the application pods's priority, we have defined a default priorityClass in our cluster, therefore all new pods will get this class as default.
Also I read in the Kubernetes documentation that when enabling podPriority, all existing pods automatically gets priority 0 set.

@mmingorance-dh
Copy link
Author

mmingorance-dh commented Nov 19, 2018

I found out how to make it works. There are a couple of parameters more to set than what cluster-autoscaler describes.
This is the right configuration for this:

kubeAPIServer:
  runtimeConfig:
      scheduling.k8s.io/v1alpha1: "true"
      admissionregistration.k8s.io/v1beta1: "true"
      autoscaling/v2beta1: "true"
  admissionControl:
  - Priority
  featureGates:
    PodPriority: "true"
kubelet:
  featureGates:
     PodPriority: "true"
kubeScheduler:
  featureGates:
    PodPriority: "true"
kubeControllerManager:
  horizontalPodAutoscalerUseRestClients: true
  featureGates:
     PodPriority: "true"

That will enable podPriority and Preemption in your cluster.

Thank you for helping me!

aleksandra-malinowska added a commit that referenced this issue Nov 22, 2018
Link AWS kops setup instructions from #1410
@aleksandra-malinowska
Copy link
Contributor

@mmingorance-dh thanks for posting the solution!

@aarongorka
Copy link

@mmingorance-dh there are a few typos in your yaml (missing colon, duplicated keys and inconsistent case), should be:

  kubeAPIServer:
    runtimeConfig:
        scheduling.k8s.io/v1alpha1: "true"
        admissionregistration.k8s.io/v1beta1: "true"
        autoscaling/v2beta1: "true"
    admissionControl:
      - Priority
    featureGates:
      PodPriority: "true"
  kubelet:
    featureGates:
       PodPriority: "true"
  kubeScheduler:
    featureGates:
      PodPriority: "true"
  kubeControllerManager:
    horizontalPodAutoscalerUseRestClients: true
    featureGates:
       PodPriority: "true"

@mmingorance-dh
Copy link
Author

@aarongorka thanks for catching that. I just updated my comment as well.

@mmingorance-dh
Copy link
Author

Updated.
Thanks @aarongorka

@njfix6
Copy link

njfix6 commented Mar 11, 2019

@mmingorance-dh Which config file do I set that config in? the kops config or the cluster auto scaler config?

@mmingorance-dh
Copy link
Author

@njfix6 in the Kops cluster config directly.

@njfix6
Copy link

njfix6 commented Mar 12, 2019

Ok cool sounds good. Is there a plan to enable this by default in 1.12 or 1.13? It would be really nice.

@mmingorance-dh
Copy link
Author

It's already enabled by default as a beta feature on Kubernetes 1.12

@njfix6
Copy link

njfix6 commented Mar 12, 2019 via email

@mmingorance-dh
Copy link
Author

You're welcome.
There is Helm chart available to overprovision the cluster and create a default priority class here: https://github.com/helm/charts/tree/master/stable/cluster-overprovisioner

Give it a try!

@linecolumn
Copy link

linecolumn commented Jun 26, 2020

@mmingorance-dh I tried to install this helm chart on cluster created with kops-1.18.0-beta1, but I didn't put any listed changes in kops configuration file from above. It is not working -- nothing happens, no pause containers created.

What is the status of this snippet when latest kops versions are in question, do we need it?

@mmingorance-dh
Copy link
Author

@linecolumn you shouldn't need the snippet configuration anymore. This configuration is only required on Kubernetes clusters with version 1.11 or previous. Priority and preemptions are enabled by default from Kubernetes 1.12. This means the chart should work out of the box.

@linecolumn
Copy link

Something is not okay, when deploying default helm chart, on fresh cluster.

No pods are created. Any ideas how to debug this chart?

$ helm --debug install overprovisioner stable/cluster-overprovisioner
install.go:159: [debug] Original chart version: ""

client.go:108: [debug] creating 2 resource(s)
NAME: overprovisioner
LAST DEPLOYED: Fri Jun 26 15:49:25 2020
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
USER-SUPPLIED VALUES:
{}

COMPUTED VALUES:
deployments: []
fullnameOverride: ""
image:
  pullPolicy: IfNotPresent
  repository: k8s.gcr.io/pause
  tag: 3.1
nameOverride: ""
priorityClassDefault:
  enabled: true
  name: default
  value: 0
priorityClassOverprovision:
  name: overprovisioning
  value: -1

HOOKS:
MANIFEST:
---
# Source: cluster-overprovisioner/templates/priorityclass-default.yaml
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: default
  labels:
    app.kubernetes.io/name: cluster-overprovisioner
    helm.sh/chart: cluster-overprovisioner-0.3.0
    app.kubernetes.io/instance: overprovisioner
    app.kubernetes.io/managed-by: Helm
value: 0
globalDefault: true
description: "Default priority class for all pods"
---
# Source: cluster-overprovisioner/templates/priorityclass-overprovision.yaml
apiVersion: scheduling.k8s.io/v1beta1
kind: PriorityClass
metadata:
  name: overprovisioning
  labels:
    app.kubernetes.io/name: cluster-overprovisioner
    helm.sh/chart: cluster-overprovisioner-0.3.0
    app.kubernetes.io/instance: overprovisioner
    app.kubernetes.io/managed-by: Helm
value: -1
globalDefault: false
description: "Priority class used for overprovision pods"

NOTES:
To verify that the cluster-overprovisioner pods have started, run:

  kubectl --namespace=default get pods -l "app.kubernetes.io/name=cluster-overprovisioner,app.kubernetes.io/instance=overprovisioner"


$ kubectl --namespace=default get pods -l "app.kubernetes.io/name=cluster-overprovisioner,app.kubernetes.io/instance=overprovisioner"
No resources found in default namespace.

@mmingorance-dh
Copy link
Author

mmingorance-dh commented Jun 26, 2020

@linecolumn The deployments value is empty:

COMPUTED VALUES:
deployments: []

This means no deployment is being created. You can see that actually there is not any deployment being created, only 2 priority clases are being created by the chart.

Please create a deployment given this example: https://github.com/helm/charts/blob/master/stable/cluster-overprovisioner/ci/additional-deploys-values.yaml

@linecolumn
Copy link

@mmingorance-dh thank you! When I put:

deployments:
  - name: extranode-filler
    annotations: {}
    replicaCount: 1
    nodeSelector: {}
    resources: {}
    tolerations: []
    affinity: {}
    labels: {}

deployment and pod is up:

$ kubectl --namespace=default get pods -l "app.kubernetes.io/name=cluster-overprovisioner,app.kubernetes.io/instance=overprovisioner"
NAME                                                              READY   STATUS    RESTARTS   AGE
overprovisioner-cluster-overprovisioner-extranode-filler-5xwcf8   1/1     Running   0          17m

But, I fail to understand/configure how this can be utilized b/c cluster-autoscaler is always reporting static_autoscaler.go:389] No unschedulable pods thus CAS is not creating new nodes.

Should I up number of replicas and/or put some resources so it kick new node creation?

@mmingorance-dh
Copy link
Author

@linecolumn The way you can overprovision a cluster with this chart is taking advantage of the podPriority and Preemptions features of Kubernetes.
If you check the chart, you'll see it creates 2 priorityclass resources. One of them is priority 0 and it becomes the default in the cluster. The other one is -1 and is the one assigned to the overprovissioner pods.

This way, those pods are running, occupying space in the cluster, and every time a new pod with a higher priority is created and in pending, one of these overprovisioner pods will be evicted and the space will be given to the pending pod which has a higher priority.
Then, the overprovisioner pod, will go in Pending and force a new machine to be launched by cluster-autoscaler to fit in the cluster.

Read: https://tech.deliveryhero.com/dynamically-overscaling-a-kubernetes-cluster-with-cluster-autoscaler-and-pod-priority/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants