Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-1710: Use mount to apply SELinux label to volumes #3172

Merged
merged 6 commits into from
Feb 2, 2022

Conversation

jsafrane
Copy link
Member

@jsafrane jsafrane commented Jan 21, 2022

  • One-line PR description: Update existing KEP to use mount -o context=XYZ to apply SELinux label to volumes

Other comments:

  • The existing KEP was turned down during API review. Reworking the KEP to expose less knobs to users and do the right thing out of the box. As a tradeoff, the KEP changes existing behavior of Kubernetes in a corner case.
  • Targeting alpha in 1.24.

cc @gnufied @xing-yang @jingxu97 @msau42

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 21, 2022
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/storage Categorizes an issue or PR as relevant to SIG Storage. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 21, 2022
Propose usage of "mount -o context=XYZ" to speed up (remove) recursive
re-labeling of volumes during pod startup.
Copy link
Member

@ddebroy ddebroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The KEP states runtime changes as a non-goal and calls out fall-back to default behavior (container-runtime driven re-labelling) when selinux labels are not specified as part of pod security context. Can any references/notes be added to any existing infra/system (say that uses a mutating webhook) for managing/applying selinux labels for pods in a safe cluster-wide fashion so that all stateful pods in a cluster can take advantage of this? A couple of other comments inline but overall, this looks great!

@jsafrane jsafrane force-pushed the rework-selinux branch 6 times, most recently from e71b9d8 to 437a681 Compare January 24, 2022 16:56
@jsafrane jsafrane changed the title WIP: 1710: Use mount to apply SELinux label to volumes 1710: Use mount to apply SELinux label to volumes Jan 24, 2022
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 24, 2022
* kubelet does not pass any special SELinux option to the container runtime (explicitly, no `:Z`).
Files on the volume already have the right SELinux context, no recursive relabeling happens there.
* When a `PodSecurityContext` specifies incomplete SELinux label (i.e. omits `SELinuxOptions.User`, `.Role` or `.Type`), kubelet fills the blanks from the system defaults provided by [ContainerLabels() from go-selinux bindings](https://github.com/opencontainers/selinux/blob/621ca21a5218df44259bf5d7a6ee70d720a661d5/go-selinux/selinux_linux.go#L770).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker - but for a casual reader of this KEP, can we add some examples of how this API will be used?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link to user story 2, which has an example.

* Second mount of the same volume fails with `mount point not mounted or bad option` when the `-o context=` does not match the first mount.
* A bind mount inherits `-o context=XYZ` option from the original mount (if it was set there).
* `/bin/mount` process must detect that the kernel supports SELinux. [It does so](https://github.com/util-linux/util-linux/blob/441f9b9303d015f1777aec7168807d58feacca31/libmount/src/context_mount.c#L291) by checking that both `/etc/selinux/config` file exists and `/sys/fs/selinux` is a mount point with "selinuxfs" type.
* `/bind/mount` throws away the SELinux mount options otherwise.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some error detection or something else when mount is silently throwing away selinux mount options? I think that can happen if driver pod is misconfigured or something else iirc? (such as /sys is not exposed inside driver pod).

Also - what happens when mount is silently throwing the mount options? Do we fallback to old behaviour or try to run pod without any selinux labels (via mount or otherwise?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no indication that /bin/mount drops SELinux mount options. See util-linux code. It's quite unfortunate.

Do we fallback to old behaviour or try to run pod without any selinux labels (via mount or otherwise?)

That's a good question. For CSI, the driver did advertise that it supports -o context. so it's driver's fault if it did not apply the context. We can't check mount options in kubelet, because not all CSI drivers mount or they may mount in a weird way. We could check context of the root of the volume and fail, if it has unexpected context.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could check context of the root of the volume and fail, if it has unexpected context.

I like this option if we can manage it. Can we add this detail to the KEP?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to detailed design.

* Kubelet's VolumeManager needs to track which SELinux label should get a volume in global mount (to call `MountDevice()` with the right mount options).
* It must call `UnmountDevice()` even when another pod wants to re-use a mounted volume, but it has a different SELinux context.
* While tracking SELinux labels of volumes, it can emit metrics suggested below.
* Volume plugins will get SELinux context as a new parameter of `MountDevice` and `SetUp`/`SetupAt` calls (resp. as a new field in `DeviceMounterArgs` / `MounterArgs`).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to track selinux context for each setup call also?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setup() is called when (bind-)mounting a volume to pod directory. This mount is not shared among pods, I think it's safe not to track it.

@@ -403,7 +389,7 @@ _This section must be completed when targeting alpha to a release._

* **How can this feature be enabled / disabled in a live cluster?**
- [X] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: SELinuxRelabelPolicy
- Feature gate name: `SELinuxMountReadWriteOncePod`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the final feature gate or while the feature works only for ReadWriteOncePod pods?

Copy link
Member Author

@jsafrane jsafrane Jan 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's the final feature gate, we need another one for all volumes. It's called SELinuxMount below.

// This is typical for volumes that represent subdirectories of a bigger shared filesystem.
//
// Default is "false".
SELinuxMountSupported *bool;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a setting at CSI driver level. Do we need something that has a finer granularity? For example, what if a CSI driver supports both block and file share volumes and it is true for block but false for file share volumes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need CSI change for this - the driver would need to advertise which volume supports SELinux mount options and which does not. Kubelet cannot not know that. I tried to avoid CSI changes here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A single CSI driver supporting multiple volume types is not well supported today. I think a CSI driver will have to filter out the SELinux mount option in the NodeStage/Publish call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect CSI drivers to filter SELinux options, then I'd prefer to have an explicit field in CSI NodeStage/NodePublish for that, so they know what to filter.

Do we really expect that? I am afraid that the most CSI drivers won't care about SELinux at all, as it's used in minority of Kubernetes distros.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in the onprem world there are more users that care about running on RHEL with SELinux. Maybe leave this as a open item to resolve before beta?


* Kubernetes checks that `SELinuxRelabelPolicy` field can be used in a pod only when at least `SELinuxOptions.Level` is set.
* *Before this KEP*: Pod A suddenly starts getting "permission denied" errors when accessing files on the volume, because the container runtime re-labeled all files on it with label Y when starting pod B. Pod B will start just fine and can access the volume.
* *As proposed in this KEP*: Pod B won't even start, because the volume is already mounted with `-o context=X`. When kubelet tries to mount the same volume with `-o context=Y`, this mount fails. The Pod B with be `ContainerCreating` until Pod A is deleted and its volumes unmounted.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have events/messages indicating why Pod B stays in ContainerCreating?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is a generic mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. /bin/mount does not tell which mount option is the bad one.

Added a note about it.

Copy link
Contributor

@deads2k deads2k Jan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is a generic mount: wrong fs type, bad option, bad superblock on /dev/sdb, missing codepage or helper program, or other error. /bin/mount does not tell which mount option is the bad one.

Can external logic on a node look at the mount that is failing and determine if it is being mounted twice with different -o context=foo options? That message is really hard for laymen to interpret and doesn't give much hint to the pod author.

This will come up in the beta PRR review.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some wording around - kubelet will track its mounts and with what context they were set up, so it may throw a nice error like volume X is already used by pod Y with another SELinux context.

When a Pod specifies incomplete SELinux label (i.e. omits `SELinuxOptions.User`, `.Role` or `.Type`), kubelet fills the blanks from the system defaults provided by [ContainerLabels() from go-selinux bindings](https://github.com/opencontainers/selinux/blob/621ca21a5218df44259bf5d7a6ee70d720a661d5/go-selinux/selinux_linux.go#L770).
A special case of the previous example is when two pods with different SELinux contexts use the same volume, but different subpaths of it.
The container runtime then re-labels only these subpaths and as long as the subpaths are different, both pods can run today.
**This will not be possible with this KEP** - while the container runtime operates on subpaths, kubelet always mounts the whole volume. The first mount with `-o context=X` will succeed, the second with `-o context=Y` will fail. The second Pod will be `ContainerCreating` as described above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will we have events/messages indicating why Pod B stays in ContainerCreating in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, mount will fail, most probably with a very generic message.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a better message above when kubelet knows that the SELinux contexts are wrong.

@jsafrane jsafrane changed the title 1710: Use mount to apply SELinux label to volumes KEP-1710: Use mount to apply SELinux label to volumes Jan 25, 2022
Comment on lines +139 to +140
* kubelet does not pass any special SELinux option to the container runtime (explicitly, no `:Z`).
Files on the volume already have the right SELinux context, no recursive relabeling happens there.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this split responsibility about who should relabel what can be a concern. but from a container runtime perspective this is fine.

@jsafrane jsafrane force-pushed the rework-selinux branch 2 times, most recently from 68637d6 to ba0ed80 Compare January 26, 2022 14:53
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 28, 2022
@jsafrane
Copy link
Member Author

I squashed old commits, added a new commit with volume reconstruction / upgrade. Marked as implementable.

@gnufied
Copy link
Member

gnufied commented Jan 31, 2022

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 31, 2022
@gnufied
Copy link
Member

gnufied commented Jan 31, 2022

The issues related to kubelet restart are solvable (although not very straightforward) and hence the kep looks good to me.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 31, 2022
* **Is the rollout accompanied by any deprecations and/or removals of features,
APIs, fields of API types, flags, etc.?**
Even if applying deprecation policies, they may still surprise some users.

**YES!** See [Behavioral changes](#behavioral-changes) and [Implementation phases](implementation-phases) above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new failure mode looks slightly better to my eyes, but it still looks difficult to debug as a pod author.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I improved the error messages where kubelet can know what's wrong.


* **What specific metrics should inform a rollback?**

TBD: We propose a metric above, file its name here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heh. forgot to fill this in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be required for Alpha. We'll figure it out later.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2022
Copy link
Member

@ddebroy ddebroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of minor nits/clarifications


### Examples
1) Kubelet knows that the in-tree AWS CSI plugin supports mounting with `-o context`. The mount option is then used (if pod context is known) and the container runtime does not relabel the volume.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: in-tree AWS CSI plugin => in-tree EBS plugin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

* Each volume plugin can choose to use the mount option `-o context=` (e.g. when `CSIDriver.SELinuxRelabelPolicy` is `true`) or ignore it (e.g. in-tree volume plugins for shared filesystems or when `CSIDriver.SELinuxRelabelPolicy` is `false` or `nil`).
* Each volume plugin then returns `SupportsSELinux` from `GetAttributes()` call, depending on if it wants the container runtime to relabel the volume (`true`) or not (`false`; the volume was already mounted with the right label or it does not support SELinux at all).
* When a CSI driver announces `SELinuxMountSupported: true`, kubelet will check that `-o context=X` was correctly applied after `NodePublish()`.
It will report error when the context in `/proc/mounts` does not match the expected value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will an error here (due to mismatch in /proc/mount wrt context) lead to the overall NodePublish being considered failed leading to retries with backoffs until the /proc/mount entry matches expectations? Or will this be a warning event that does not block pod sandbox creation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial idea that it will be a SetUp() error, i.e. kubelet will retry SetUp() again, until the driver fixes it. The pod won't start.

Question is, if the driver can fix it, if it mounted the volume wrong at the first attempt - any retry probably won't help much here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewording the bullet to mention the consequence - the pod won't ever start, the CSI driver is faulty.

@@ -403,7 +403,7 @@ _This section must be completed when targeting alpha to a release._

* **How can this feature be enabled / disabled in a live cluster?**
- [X] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: SELinuxRelabelPolicy
- Feature gate name: `SELinuxMountReadWriteOncePod`
- Components depending on the feature gate: apiserver (API validation only), kubelet
- [ ] Other
- Describe the mechanism:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this last time. You're missing an answer for this section. i believe you need need a drain and restart of the kubelet to change this setting. Please describe that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't notice this last time. You're missing an answer for this section. i believe you need need a drain and restart of the kubelet to change this setting. Please describe that.

I see the answer a few questions down.

@deads2k
Copy link
Contributor

deads2k commented Feb 1, 2022

the monitoring and troubleshooting sections are often the hardest when moving from alpha to beta. I encourage you to consider those flows while developing the alpha.

The PRR is complete for alpha.

/approve

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 1, 2022
@jsafrane
Copy link
Member Author

jsafrane commented Feb 2, 2022

Added a commit with last minute edits.

@xing-yang
Copy link
Contributor

Can you squash the commits?

/lgtm
/approve

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2022
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 2, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: deads2k, jsafrane, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xing-yang
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 2, 2022
@k8s-ci-robot k8s-ci-robot merged commit a40b53b into kubernetes:master Feb 2, 2022
@k8s-ci-robot k8s-ci-robot added this to the v1.24 milestone Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants