Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add skip permission change kep #1478

Merged
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
166 changes: 166 additions & 0 deletions keps/sig-storage/20200120-skip-permission-change.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
---
title: Skip Volume Ownership Change
authors:
- "@gnuified"
owning-sig: sig-storage
participating-sigs:
- sig-auth
reviewers:
- "@msau42"
- "@liggit"
- "@tallclair"
approvers:
- "@saad-ali"
editor: TBD
creation-date: 2020-01-20
last-updated: 2020-01-20
status: implementable
see-also:
replaces:
superseded-by:

---

# Skip Volume Ownership Change

## Table of Contents

<!-- toc -->
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Implementation Details/Notes/Constraints [optional]](#implementation-detailsnotesconstraints-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Graduation Criteria](#graduation-criteria)
- [Test Plan](#test-plan)
- [Monitoring](#monitoring)
- [Implementation History](#implementation-history)
- [Drawbacks [optional]](#drawbacks-optional)
- [Alternatives [optional]](#alternatives-optional)
- [Infrastructure Needed [optional]](#infrastructure-needed-optional)
<!-- /toc -->

## Summary

Currently before a volume is bind-mounted inside a container the permissions on
that volume are changed recursively to the provided fsGroup value. This change
in ownership can take an excessively long time to complete, especially for very
large volumes (>=1TB) as well as a few other reasons detailed in [Motivation](#motivation).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Is the issue large volumes or the number of files/directories in the volume? Let's be very clear about that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this in a follow up

To solve this issue we will add a new field called `pod.Spec.Volumes[].FSGroupPermissionChangePolicy` and
allow the user to specify whether they want the permission and ownership change to occur for that volume.

## Motivation

When a volume is mounted on the node, we recursively change permissions of volume
before bind mounting the volume inside container. The reason of doing this is to ensure
that volumes are readable/writable by provided fsGroup.

But this presents following problems:
- An application(many popular databases) which is sensitive to permission bits changing
underneath may refuse to start whenever volume being used inside pod gets mounted on
different node.
gnufied marked this conversation as resolved.
Show resolved Hide resolved
- If volume has a large number of files, performing recursive `chown` and `chmod`
could be slow and could cause timeout while starting the pod.

### Goals

- Allow volume ownership and permission change to be skipped during mount

### Non-Goals

- In some cases if user brings in a large enough volume from outside, the first time ownership and permission change still could take lot of time.
- On SELinux enabled distributions we will still do recursive chcon whenever applicable and handling that is outside the scope.
gnufied marked this conversation as resolved.
Show resolved Hide resolved
gnufied marked this conversation as resolved.
Show resolved Hide resolved
- This proposal does not attempt to fix two pods using same volume with conflicting fsgroup. `FSGroupPermissionChangePolicy` also will be only applicable to volume types which support setting fsgroup.
gnufied marked this conversation as resolved.
Show resolved Hide resolved

## Proposal

We propose that an user can optionally opt-in to skip recursive ownership(and permission) change on a volume if volume already has right permissions.

### Implementation Details/Notes/Constraints [optional]

Currently Volume permission and ownership change is done using a breadth-first algorithm starting from root directory of the volume. As part of this proposal we will change the algorithm to modify permissions/ownership of the root directory after all directories/files underneath have their permissions changed.

When creating a pod, we propose that `pod.Spec.Volumes[]` field expanded to include a new field called `FSGroupPermissionChangePolicy` which can have following possible values:

- `Always` --> Always change the permissions and ownership to match fsGroup. This is the current behavior and it will be the default one when this proposal is implemented. Algorithm that performs permission change however will be changed to perform permission change of top level directory last.
- `OnRootDirPermissionMismatch` --> Only perform permission and ownership change if permissions of top level directory does not match with expected permissions and ownership.
saad-ali marked this conversation as resolved.
Show resolved Hide resolved

```go
type PodFSGroupPermissionChangePolicy string

const(
ChangePermissionOnRootDirmismatch PodFSGroupPermissionChangePolicy = "OnRootDirPermissionMismatch"
AlwaysChangeVolumePermission PodFSGroupPermissionChangePolicy = "Always"
)

// Volume represents a named volume in a pod that may be accessed by any containers in the pod.
type Volume struct {
gnufied marked this conversation as resolved.
Show resolved Hide resolved
// Required: This must be a DNS_LABEL. Each volume in a pod must have
// a unique name.
Name string
// The VolumeSource represents the location and type of a volume to mount.
// This is optional for now. If not specified, the Volume is implied to be an EmptyDir.
// This implied behavior is deprecated and will be removed in a future version.
// +optional
VolumeSource
// FSGroupPermissionChangePolicy ← new field
// Defines behavior of changing ownership and permission of the volume
// before being exposed inside Pod. This field will only apply to
// volume types which support fsGroup based ownership(and permissions).
// It will have no effect on ephemeral volume types such as: secret, configmaps
// and emptydir.
Comment on lines +103 to +104
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: How is it going to interact with ephemeral CSI volumes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the heuristics defined for detecting fsGroup support for CSI volumes - detects that changing volume ownership is supported , then proposed policies will work for ephemeral CSI volumes. Also these policies have any effect only when a fsGroup is present in pod's spec and hence we should be fine.

// + optional
FSGroupPermissionChangePolicy *PodFSGroupPermissionChangePolicy
gnufied marked this conversation as resolved.
Show resolved Hide resolved
}
```

### Risks and Mitigations

- One of the risks is if user volume's permission was previously changed using old algorithm(which changes permission of top level directory first) and user opts in for `ChangePermissionOnRootDirmismatch` `FSGroupPermissionChangePolicy` then we can't distinguish if the volume was previously only partially recursively chown'd.


## Graduation Criteria

* Alpha in 1.18 provided all tests are passing and gated by the feature Gate
`ConfigurableFSGroupPermissions` and set to a default of `False`

* Beta in 1.19 with design validated by at least two customer deployments
(non-production), with discussions in SIG-Storage regarding success of
deployments. A metric will be added to report time taken to perform a
volume ownership change. Also e2e tests that verify volume permissions with various `FSGroupPermissionChangePolicy`.
* GA in 1.20, with Node E2E tests in place tagged with feature Storage
gnufied marked this conversation as resolved.
Show resolved Hide resolved


[umbrella issues]: https://github.com/kubernetes/kubernetes/issues/69699

### Test Plan

A test plan will consist of the following tests

* Basic tests including a permutation of the following values
- pod.Spec.Volumes[].FSGroupPermissionChangePolicy (ChangePermissionOnRootDirmismatch/Always)
- PersistentVolumeClaimStatus.FSGroup (matching, non-matching)
- Volume Filesystem existing permissions (none, matching, non-matching, partial-matching?)
* E2E tests


### Monitoring

We will add a metric that measures the volume ownership change times.

## Implementation History

- 2020-01-20 Initial KEP pull request submitted

## Drawbacks [optional]


## Alternatives [optional]

gnufied marked this conversation as resolved.
Show resolved Hide resolved
We considered various alternatives before proposing changes mentioned in this proposal.
- We considered using a shiftfs(https://github.com/linuxkit/linuxkit/tree/master/projects/shiftfs) like solution for mounting volumes inside containers without changing permissions on the host. But any such solution is technically not feasible until support in Linux kernel improves.
- We also considered redesigning volume permission API to better support different volume types and different operating systems because `fsGroup` is somewhat Linux specific. But any such redesign has to be backward compatible and given lack of clarity about how the new API should look like, we aren't quite ready to do that yet.

## Infrastructure Needed [optional]