Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sig-node/1029-ephemeral-storage-quotas/*: #4699

Merged
merged 1 commit into from
Jun 13, 2024

Conversation

PannagaRao
Copy link
Contributor

Modifications to push the changes to beta 1.31

KEP-1029: adding beta graduation criteria

  • One-line PR description: Adding Implementation Details
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 5, 2024
Copy link

linux-foundation-easycla bot commented Jun 5, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: PannagaRao / name: Pannaga Rao Bhoja Ramamanohara (2e6b66b)

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Jun 5, 2024
@k8s-ci-robot
Copy link
Contributor

Welcome @PannagaRao!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Jun 5, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @PannagaRao. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 5, 2024
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 5, 2024
@haircommander
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 6, 2024
@haircommander haircommander mentioned this pull request Jun 6, 2024
22 tasks
@haircommander
Copy link
Contributor

per our slack convo

/cc @soltysh for PRR review

@k8s-ci-robot
Copy link
Contributor

@haircommander: GitHub didn't allow me to request PR reviews from the following users: for, PRR, review.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

per our slack convo

/cc @soltysh for PRR review

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot requested a review from soltysh June 7, 2024 14:11
@PannagaRao PannagaRao marked this pull request as ready for review June 7, 2024 19:37
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 7, 2024
@k8s-ci-robot k8s-ci-robot requested a review from mrunalp June 7, 2024 19:37
@@ -576,6 +579,59 @@ required elsewhere:
future allow adding additional data without having to change code
other than that which uses the new information.

#### Implementation details of using User Namespace to Toggle Ephemeral-Storage-Quotas
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to a reviewer: the approach described here means the kubelet may enforce du quotas and xfs quotas simultaneously, if pods without a user namespace have a limit

disk usage monitoring method.

* During container creation, there must be a check to verify
if xfs-quota in Kubelet is enabled. Additionally, verify if
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what we mean by xfs-quota in Kubelet check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to check whether or not LocalStorageCapacityIsolationFSQuotaMonitoring flag is enabled. Modified the description to add the same.

@@ -576,6 +579,71 @@ required elsewhere:
future allow adding additional data without having to change code
other than that which uses the new information.

#### Implementation details of using User Namespace to Toggle Ephemeral-Storage-Quotas

* The implementation centers around a feature toggle which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we describe in more detail why we need to enable user namespaces?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the description. Please let me know if it sounds good


* The SIG raised the possibility of a container being unable to exit
should we enforce quotas, and the quota interferes with writing the
log. This can be mitigated by either not applying a quota to the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logs are written by a container monitor process to a directory which should't be under a quota so we need to clarify here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this piece was just copy pasted from below. we can iterate here but just noting this isn't new

Copy link
Contributor Author

@PannagaRao PannagaRao Jun 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enforcement of quota on log directory is the existing behavior
https://discuss.kubernetes.io/t/why-stdout-logs-are-accounted-in-ephemeral-storage-usage/27815

controls the integration of xfs-quota with user namespaces for
managing filesystem quotas within containers. The toggle allows
to switch between a quota management system and a traditional
disk usage monitoring method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe where the toggle is and the entity responsible for setting/unsetting it?

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have missed, but I'm missing an explanation how a user can be notified if the functionality is not working for them, ideally that will be part of ###### How can someone using this feature know that it is working for their instance? question that's missing. I'm mostly interested in how you report back to a user that the requested quota won't work, due to unsupported filesystem, or others like that.

@@ -24,21 +24,23 @@
- [Future](#future)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're re-doing the beta PRR review, I'd update https://github.com/kubernetes/enhancements/blob/master/keps/prod-readiness/sig-node/1029.yaml#L5 as well with my name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The template has changed significantly between then and now can you update to match https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md It looks like some of the current headers will need to be changed, some examples are:

  • ### Implementation Details/Notes/Constraints [optional] -> ### Notes/Constraints/Caveats (Optional)
  • the section ## Design Details seems empty, I'd probably move the bits from under #### Notes on Implementation which has all those details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done



## Design Details

### Test Plan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably worth updating the unit test coverage, they are 2 years old.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason why we no integration tests are planned?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kubelet doesn't really define many integration tests, going for unit and then node e2e instead

@@ -767,7 +801,7 @@ and are not reported here.
| du after umount/mount | 8.1 | 10.2 | 3.9 | 3.7 |
| Remove Files | 4.3 | 4.1 | 4.2 | 4.3 |

### System CPU Time
#### System CPU Time

| *Operation* | *XFS+Quota* | *XFS* | *Ext4fs+Quota* | *Ext4fs* |
|--------------------------|-----------|-------|--------------|--------|
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the PRR section you're missing these 2:

  • ### Upgrade / Downgrade Strategy
  • ### Version Skew Strategy

Also in the ### Monitoring Requirements section can you update the questions to match the new template? Mostly it's just changing from * **<question... to ##### <question... and you're missing the following question:

  • ###### How can someone using this feature know that it is working for their instance?

Same problem with ### Scalability section, and missing question:

  • ###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Missing answer for ###### How does this feature react if the API server and/or etcd is unavailable?.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more nits.

XFS quotas within user namespaces will not be supported


### Performance Benchmarks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this entire section seems best suited in the testing section, I'd just move all of that after ##### e2e tests.

@@ -16,8 +16,8 @@ editor: TBD
creation-date: 2018-09-06
last-updated: 2023-08-31
status: implementable
latest-milestone: "1.29"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document also needs update with the latest template https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/kep.yaml it's way out of date 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -840,65 +890,78 @@ If LocalStorageCapacityIsolationFSQuotaMonitoring is turned on but LocalStorageC

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In question ###### How can this feature be enabled / disabled in a live cluster? you only mentioned kubelet with LocalStorageCapacityIsolationFSQuotaMonitoring flag. But earlier in the #### Kubelet and API Server Skew: section you're talking about both kubelet and apiserver. So I'd expect this question to explicitly point out both feature gates. The same applies to the kep.yaml, I've pointed out it's out of date, and there's a separate section describing feature gates and components.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### How can a rollout or rollback fail? Can it impact already running workloads? you're saying that it won't affect already running workloads. What about the ones which enabled this functionality?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? you probably want to just say No, since you're not removing any fields, features, flags, or anything like that.


###### How can someone using this feature know that it is working for their instance?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, this question should be a bit higher in the doc, right after ###### How can an operator determine if the feature is in use by workloads? 😉

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


- Yes, the feature depneds on project quotas. Once quotas are enabled, the xfs_quota tool can be used to
set limits and report on disk usage.
Yes, the feature depneds on project quotas. Once quotas are enabled, the xfs_quota tool can be used to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Yes, the feature depneds on project quotas. Once quotas are enabled, the xfs_quota tool can be used to
Yes, the feature depends on project quotas. Once quotas are enabled, the xfs_quota tool can be used to

beta: "1.31"

feature-gates:
- name: LocalStorageCapacityIsolationFSQuotaMonitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soltysh : @PannagaRao and I chatted about this and it feels more right just to include this one as it's net new for this feature. above in the enablement section we do mention UserNamespacesSupport as it's required, but I think it's more correct to just have this feature gate here. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, let's keep it that way.

Copy link
Contributor

@soltysh soltysh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
for PRR

@haircommander
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2024
Modifications to push the changes to beta 1.31

Signed-off-by: PannagaRamamanohara <pbhojara@redhat.com>
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2024
Copy link
Contributor

@mrunalp mrunalp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 13, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mrunalp, PannagaRao, soltysh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 13, 2024
@k8s-ci-robot k8s-ci-robot merged commit d673b64 into kubernetes:master Jun 13, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone Jun 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

5 participants