Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-2400: Update swap KEP for 1.23 beta #2858

Merged
merged 4 commits into from
Sep 8, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next Next commit
Update swap KEP for 1.23 beta
Fill out remaining beta PRR questions, add test plans
  • Loading branch information
ehashman committed Sep 3, 2021
commit 20a8885bc6a5c815ec9965c66a6b202de7ad3686
2 changes: 2 additions & 0 deletions keps/prod-readiness/sig-node/2400.yaml
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
kep-number: 2400
alpha:
approver: "@deads2k"
beta:
approver: "@deads2k"
59 changes: 57 additions & 2 deletions keps/sig-node/2400-node-swap/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -401,8 +401,12 @@ For alpha:
and further development efforts.
- Focus should be on supported user stories as listed above.

Once this data is available, additional test plans should be added for the next
phase of graduation.
For beta:
ehashman marked this conversation as resolved.
Show resolved Hide resolved

- Add e2e tests that exercise all available swap configurations via the CRI.
- Add e2e tests that verify pod-level control of swap utilization.
- Add e2e tests that verify swap performance with pods using a tmpfs.
- Verify new system-reserved settings for swap memory.

### Graduation Criteria

Expand Down Expand Up @@ -587,13 +591,29 @@ Try to be as paranoid as possible - e.g., what if some components will restart
mid-rollout?
-->

If a new node with swap memory fails to come online, it will not impact any
running components.
Comment on lines +597 to +598
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone does an in-place upgrade on a node (stopping kubelet, starting a new kubelet on the same server), can that fail? How?

If it could fail then an upgrade might for example take out the nodes where the control plane ought to be running as static pods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The in-place upgrade would not fail unless swap access was added while the node was still online. Normally turning swap on and off at runtime isn't considered best practice for a production environment, I'd expect a node to be reimaged and rebooted, but I can mention it.


It is possible that if a cluster administrator adds swap memory to an already
running node, and then performs an in-place upgrade, the new kubelet could fail
to start unless the configuration was modified to tolerate swap. However, we
would expect that if a cluster admin is adding swap to the node, they will also
update the kubelet's configuration to not fail with swap present.

Generally, it is considered best practice to add a swap memory partition at
node image/boot time and not provision it dynamically after a kubelet is
already running and reporting Ready on a node.

###### What specific metrics should inform a rollback?

<!--
What signals should users be paying attention to when the feature is young
that might indicate a serious problem?
-->

Workload churn or performance degradations on nodes. The metrics will be
application/use-case specific, but we can provide some suggestions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the spot to provide those suggestions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we don't have them yet (comment below).


###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

<!--
Expand All @@ -602,12 +622,17 @@ Longer term, we may want to require automated upgrade/rollback tests, but we
are missing a bunch of machinery and tooling and can't do that now.
-->

N/A because swap support lacks a runtime upgrade/downgrade path; kubelet must
be restarted with or without swap support.

###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

<!--
Even if applying deprecation policies, they may still surprise some users.
-->

No.

### Monitoring Requirements

<!--
Expand All @@ -622,12 +647,21 @@ checking if there are objects with field X set) may be a last resort. Avoid
logs or events for this purpose.
-->

KubeletConfiguration has set `failOnSwap: false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I tell two nodes apart:

  • one has failOnSwap: false and memorySwap set to swapBehavior: LimitedSwap, with the NodeSwap feature gate enabled
  • another has failOnSwap: false and the NodeSwap feature gate disabled

via the kubernetes API? If so, I'd mention how to distinguish them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if that is bubbled up to the API Server. The purpose of this question is for beta, when the feature gate is defaulted on, so you can't rely on it being turned on as a sign that the feature is in use. We might be able to check if swapBehavior is explicitly set, but empty string is equivalent to LimitedSwap.

Realistically, this KEP iterates on the existed unsupported configuration with failOnSwap: false. Because it was previously unsupported, I am assuming here that a production environment would not have it set if it were not using this feature.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once this is beta we should assume that people have this feature gate set to a value of their choice. For alpha it was different: you need to be a little more brave to try it and most clusters run with the default, ie feature enabled.

The switch from unsupported to “mostly supported, but it's still beta” is why I'm asking about observability.


The prometheus `node_exporter` will also export stats on swap memory
utilization.
ehashman marked this conversation as resolved.
Show resolved Hide resolved

###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

<!--
Pick one more of these and delete the rest.
-->

TBD. We will determine a set of metrics as part of beta graduation. We will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The beta criteria (including this metric) is listed as tentative. Can you commit to the beta criteria about metrics?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Figuring out what these metrics are is blocking for beta graduation, but we need end-user data/a release cycle of testing to determine them, which will happen during the dev cycle. The graduation of swap is a little different from a regular Kubernetes feature in that we need to be able to look at metrics node-wide.

This is why it might be a stretch to graduate this for beta this release, but I don't want to block the other required dev work for beta just because we don't have this list yet.

need more data; there is not a single metric or set of metrics that can be used
to generally quantify node performance.

- [ ] Metrics
- Metric name:
- [Optional] Aggregation method:
Expand All @@ -647,13 +681,17 @@ high level (needs more precise definitions) those may be things like:
- 99,9% of /health requests per day finish with 200 code
-->

N/A

###### Are there any missing metrics that would be useful to have to improve observability of this feature?

<!--
Describe the metrics themselves and the reasons why they weren't added (e.g., cost,
implementation difficulties, etc.).
-->

N/A
ehashman marked this conversation as resolved.
Show resolved Hide resolved

### Dependencies

<!--
Expand Down Expand Up @@ -784,6 +822,8 @@ details). For now, we leave it here.

###### How does this feature react if the API server and/or etcd is unavailable?

No change. Feature is specific to individual nodes.

###### What are other known failure modes?

<!--
Expand All @@ -799,8 +839,23 @@ For each of them, fill in the following information by copying the below templat
- Testing: Are there any tests for failure mode? If not, describe why.
-->


Individual nodes with swap memory enabled may experience performance
degradations under load. This could potentially cause a cascading failure on
nodes without swap: if nodes with swap fail Ready checks, workloads may be
rescheduled en masse.

Thus, cluster administrators should be careful while enabling swap. To minimize
disruption, you may want to taint nodes with swap available to protect against
this problem. Taints will ensure that workloads which tolerate swap will not
spill onto nodes without swap under load.

###### What steps should be taken if SLOs are not being met to determine the problem?

It is suggested that if nodes with swap memory enabled cause performance or
stability degradations, those nodes are cordoned, drained, and replaced with
nodes that do not use swap memory.

## Implementation History

- **2015-04-24:** Discussed in [#7294](https://github.com/kubernetes/kubernetes/issues/7294).
Expand Down
4 changes: 2 additions & 2 deletions keps/sig-node/2400-node-swap/kep.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,12 +20,12 @@ prr-approvers:
- "@deads2k"

# The target maturity stage in the current dev cycle for this KEP.
stage: alpha
stage: beta

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.22"
latest-milestone: "v1.23"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
Expand Down