Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add user docs for pod priority and preemption #5328

Merged
merged 3 commits into from
Sep 13, 2017
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Add user docs for pod priority and preemption
  • Loading branch information
bsalamat committed Sep 13, 2017
commit 4036aaee1f6e3f313ec95d8948bf074714788035
1 change: 1 addition & 0 deletions _data/concepts.yml
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ toc:
- docs/concepts/configuration/taint-and-toleration.md
- docs/concepts/configuration/secret.md
- docs/concepts/configuration/organize-cluster-access-kubeconfig.md
- docs/concepts/configuration/pod-priority-preemption.md

- title: Services, Load Balancing, and Networking
section:
Expand Down
209 changes: 209 additions & 0 deletions docs/concepts/configuration/pod-priority-preemption.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,209 @@
---
approvers:
- davidopp
- wojtek-t
title: Pod Priority and Preemption (Alpha)
---

[Pods](/docs/user-guide/pods) in Kubernetes 1.8 and later can have priority. Priority
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, wasn't priority itself added in 1.7?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no -- kubernetes/kubernetes#48377 was merged Jul 19.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

David is right. We didn't have pod priority in 1.7.

indicates importance of a pod relative to other pods. When a pod cannot be scheduled, scheduler tries
to preempt (evict) lower priority pods in order to make scheduling of the pending pod possible.
In a future Kubernetes release, priority will also affect out-of-resource eviction ordering on the node.

Note that preemption does not respect PodDisruptionBudget; see
[the limitations section](#poddisruptionbudget-is-not-supported) for more details.

* TOC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I clicked view - something is not generating correctly for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this file is not displayed correctly by clicking github's "view" button. You should follow the link that k8sio-netlify-preview-bot leaves on the PR to see the generated page.

{:toc}

## How to use it
In order to use priority and preemption in Kubernetes 1.8, you should follow these
steps:

1. Enable the feature.
1. Add one or more PriorityClasses.
1. Create pods with `PriorityClassName` set to one of the added PriorityClasses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add to the end of this something like "(Of course you do not need to create the pods directly; normally you would add PriorirtyClassName to the pod template of whatever set object is managing your pods, for example a Deployment.)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

(Of course you do not need to create the pods directly; normally you would add
`PriorityClassName` to the pod template of the collection object managing your
pods, for example a Deployment.)

The following sections provide more information about these steps.

## Enable Priority and Preemption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also mention that we should enable admission controller plugin here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, there has to be --admisstion-control=...,Priority command line option.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we should always have ResourceQuota as last one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The admission controller is already enabled, but checks the feature gate. Enabling feature gate should be enough to activate the admission controller as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also need --runtime-config=scheduling.k8s.io/v1alpha1=true, right? This could have worked because we are enabling this alpha API by default in current code base. However, the convention was to have all alpha APIs disabled by default, IIRC.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tengqm You are probably right. I added it to the doc. I will build K8s and try it to make sure. Thanks for pointing it out.

@gyliu513 The admission controller is already added to the list of default admission controllers, if that's what you mean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bsalamat can you please show me the detail where did we add it to the list of default admission controllers? I did not find the code where it was added, am I missing anything?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR: https://github.com/kubernetes/kubernetes/pull/49322/files
I am not sure if it was the right thing to do given the downgrade issue we are observing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see, thanks @bsalamat , I think that when downgrade, the downgrade script should have some logic to delete this admission control plugin.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I would expect as well, but I am not sure if the downgrade process actually removes them, otherwise #52226 shouldn't have happened.

Pod priority and preemption is disabled by default in Kubernetes 1.8 as it is an
__alpha__ feature. It can be enabled by a command-line flag for API server and scheduler:

```
--feature-gates=PodPriority=true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please also specify for which componment shall I specify this parameter? I think we should enable it for both scheduler and apiserver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't it the default that feature-gates are shared across all master components?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No matter what it is, I agree it should be mentioned explicitly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for specifying which components it needs to be specified on. Since the flag is used in API server and scheduler, I assume those are the components ,as @gyliu513 says.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

```
and also the following command-line flag for API server:
```
--runtime-config=scheduling.k8s.io/v1alpha1=true
```

Once enabled you can add [PriorityClasses](#priorityclass) and create pods with [`PriorityClassName`](#pod-priority) set.
If you tried it and decided to disable it, you must remove this command-line flag or
set it to false and restart API server and Scheduler. Once disabled, the existing
pods will keep their priority fields, but preemption will be disabled and priority
fields will be ignored, and you will not be able to set PriorityClassName in new pods.

**Note:** Alpha features should not be used in production systems! Alpha
features are more likely to have bugs and future changes to them are not guaranteed to
be backward compatible.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not guaranteed to be backward compatible.

Do we have doc on the scope of alpha, beta and GA?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

em... maybe you are referring to this? https://kubernetes.io/docs/reference/deprecation-policy/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@k82cn This doc tries to state the features/improvements planned for Beta, but we don't have any other doc at this point.


## PriorityClass
PriorityClass is a non-namespaced object that defines a mapping from a priority
class name (represented in the "name" field of the PriorityClass object's metadata)
to the integer value of the priority. The higher the value, the higher the
priority. The value is
specified in `value` field which is required. PriorityClass
objects can have any 32-bit integer value smaller than or equal to 1 billion. Larger
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if bigger than 1 billion, will reject or ignore?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Priority admission controller rejects them.

numbers are reserved for critical system pods that should not normally be preempted or
evicted. A cluster admin should create one PriorityClass object for each such
mapping that they want.

PriorityClass also has two optional fields: `globalDefault` and `description`.
`globalDefault` indicates that the value of this PriorityClass should be used for
pods without a `PriorityClassName`. Only one PriorityClass with `globalDefault`
set to true can exist in the system. If there is no PriorityClass with `globalDefault`
set, priority of pods with no `PriorityClassName` will be zero.

`description` is an arbitrary string. It is meant to tell users of the cluster
when they should use this PriorityClass.


**Note 1:** If you upgrade your existing cluster and enable this feature, the priority
of your existing pods will be considered to be zero.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though you mentioned it earlier, I would add to the end of this something like "(As mentioned earlier, until you explicitly enable the feature, you will not be able to create PriorityClass objects or set PriorityClassName on pods, and no preemption will happen.)"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? It is easy for me to copy/paste, but shorter, more concise user docs are generally better. That's why I am a bit hesitant to repeat information.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed "and enable this feature" so I think it's fine to leave as-is.


**Note 2:** Addition of a PriorityClass with `globalDefault` set to true does not
change priority of existing pods. The value of such PriorityClass will be used only
for pods created after the PriorityClass is added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add something to this section about what happens if you delete a PriorityClass object. Also you should mention that if you submit a pod that has a PriorityClassName that doesn't have a corresponding PriorityClass, the pod will be rejected by the admission controller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. Added the first one here and the second one to the "Pod Priority" section.

**Note 3:** If you delete a PriorityClass, existing pods that use the name of the
deleted priority class will remain unchanged, but you will not be able to create more pods
that use the name of the deleted priority class.

#### Example PriorityClass
```yaml
apiVersion: v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "This priority class should be used for XYZ service pods only."
```

## Pod Priority
Once you have one or more PriorityClasses, you can create pods which specify one
of those PriorityClass names in their spec. Priority admission controller uses
`priorityClassName` field and populates the integer value of priority. If the priority
class is not found, the pod will be rejected.

The following YAML is an example of a pod configuration that uses the PriorityClass
created above. Priority admission controller checks the spec and resolves the
priority of the pod to 1000000.


```yaml
apiVersion: v1
kind: Pod
metadata:
name: nginx
labels:
env: test
spec:
containers:
- name: nginx
image: nginx
imagePullPolicy: IfNotPresent
priorityClassName: high-priority
```

## Preemption
When pods are created, they go to a queue and wait to be scheduled. Scheduler picks a pod
from the queue and tries to schedule it on a node. If no node is found that satisfies
all the specified requirements (predicates) of the pod, preemption logic is triggered
for the pending pod. Let's call the pending pod P.
Preemption logic tries to find a node where removal of one or more pods with lower priority
than P would enable P to schedule on that node. If such a node is found, one or more lower priority pods will
be deleted from the node. Once the pods are gone, P may be scheduled on the node.

### Limitations of Preemption (alpha version)

#### Starvation of Preempting Pod
When pods are preempted, the victims get their
[graceful termination period](https://kubernetes.io/docs/concepts/workloads/pods/pod/#termination-of-pods).
They have that much time to finish their work and exit. If they don't, they will be
killed. This graceful termination period creates a time gap between the point that
scheduler preempts pods until the pending pod (P) can be scheduled on the node (N).
In the meantime, scheduler keeps scheduling other pending pods.
As victims exit or get terminated, scheduler tries to schedule pods in the pending
queue, and one or more of them may be considered and scheduled to N before the
scheduler considers scheduling P on N. In such a case, it is likely that
when all victims exit, pod P won't fit on node N anymore. So, scheduler will have to
preempt other pods on node N or another node to let P schedule. This scenario may
be repeated again for the second and subsequent rounds of preemption and P may not
get scheduled for a while. This scenario can cause problems in various clusters, but
is particularly problematic in clusters with a high pod creation rate.

We will address this problem in beta version of pod preemption. The solution
we plan to implement is [provided here](https://github.com/kubernetes/community/blob/master/contributors/design-proposals/pod-preemption.md#preemption-mechanics).

#### PodDisruptionBudget is not supported
[Pod Disruption Budget (PDB)](https://kubernetes.io/docs/concepts/workloads/pods/disruptions/)
allows application owners to limit the number pods of a replicated application that
are down simultaneously from voluntary disruptions. However, alpha version of
preemption does not respect PDB when choosing preemption victims.
We plan to add PDB support in beta, but even in beta respecting PDB will be best
effort. Scheduler will try to find victims whose
PDB won't be violated by preemption, but if no such victims are found, preemption
will still happen and lower priority pods will be removed despite their PDBs
being violated.

#### Inter-Pod Affinity on Lower Priority Pods
The current implementation of preemption considers a node for preemption only when
the answer to this question is positive: "If all the pods with lower priority than
the pending pod are removed from the node, can the pending pod be scheduled on
the node?"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add this sentence:
(Note that preemption does not always remove all lower-priority pods, e.g. if the pending pod can be scheduled by removing fewer than all lower-priority pods, but this test must always pass for preemption to be considered on a node.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

(Note that preemption does not always remove all lower-priority pods, e.g. if the
pending pod can be scheduled by removing fewer than all lower-priority pods, but this
test must always pass for preemption to be considered on a node.)

If the answer is no, that node will not be considered for preemption. If the pending
pod has inter-pod affinity to one or more of those lower priority pods on the node, the
inter-pod affinity rule cannot be satisfied in the absence of the lower priority
pods and scheduler will find the pending pod infeasible on the node. As a result,
it will not try to preempt any pods on that node.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say "it will not try to schedule the pod onto the node."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the current sentence more accurate? We are talking about preemption here and scheduling will be the next step which may or may not happen on this node.

Scheduler will try to find other nodes for preemption and could possibly find another
one, but there is no guarantee that such a node will be found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better move line 153-154 to around line 108 ? This may not be a problem specific to this affinity scenario.

Copy link
Member Author

@bsalamat bsalamat Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these lines blend well with the text at line 108.


We may address this issue in future versions, but we don't have a clear plan yet
(i.e. we will not consider it a blocker for Beta or GA). Part
of the reason is that finding the set of lower priority pods that satisfy all
inter-pod affinity rules is computationally expensive and adds substantial
complexity to the preemption logic. Besides, even if preemption keeps the lower
priority pods to satisfy inter-pod affinity, the lower priority pods may be preempted
later by other pods, which removes the benefits of having the complex logic of
respecting inter-pod affinity to lower priority pods.

Our recommended solution for this problem is to create inter-pod affinity only towards
equal or higher priority pods.

#### Cross Node Preemption
When considering a node N for preemption in order to schedule a pending pod P,
P may become feasible on N only if pods on other nodes are preempted. For example, P may
have zone anti-affinity with some currently-running, lower-priority pod Q. P may not be
scheduled on Q's node even if it preempts Q, for example if P is larger than Q so
preempting Q does not free up enough space on Q's node and P is not high-priority enough
to preempt other pods on Q's node. But P might theoretically be able to schedule on
another node M by preempting Q and some pod(s) on M (preempting Q removes the
anti-affinity violation, and preempting pod(s) on M frees up space for P to schedule
there). The current preemption algorithm does not detect and execute such preemptions;
that is, when determining whether P can schedule onto N, it only considers preempting
pods on N.

We may consider adding cross node preemption in future versions if we find an
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be explicit: "Fixing this will NOT be Beta or GA blocker".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, I am a bit hesitant to expose users to our release planning in this kind of docs. I added a sentence to say that we cannot promise anything.

algorithm with reasonable performance, but we cannot promise anything at this point
(It will not be considered a blocker for Beta and GA).