-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add spec change for volume mount group #468
Add spec change for volume mount group #468
Conversation
3c23d41
to
e2f95a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like there should be a clarification near the top or bottom of a spec that explains that when we say "all the files and directories are readable and writable" that that is a continuous expectation for the life of the publish, not a point in time expectation at publish time. That clarification would rule out the chown/chmod based tricks.
spec.md
Outdated
// It is expected that SP SHOULD use provided volume_mount_group | ||
// for mounting the volume and volume should remain readable and | ||
// writable by workloads associated with volume_mount_group until | ||
// corresponding NodeUnstageVolume or NodePublishVolume is called. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bswartz how does this sound to indicate lifetime of group based permissions?
061c7f6
to
14ea8bc
Compare
csi.proto
Outdated
// provided volume_mount_group and all files and directories | ||
// within the volume are readable and writable by the provided | ||
// volume_mount_group. | ||
// The value of volume_mount_group should be group_id or group name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have to be clear whether COs may specify a name or an ID here. If we can avoid specifically mentioning UNIX and Windows, and use language that applies equally well to both, that would be ideal.
For example, we could say it must be the ID of the group. On UNIX that would imply decimal-formatted GID string. On Windows that would imply a SID string. If we say that names are also acceptable, then I would expect that to be true on both Wndows and UNIX.
I want to be unambiguous what the CO is required to pass down here, and I want it to be unambiguous what the SP need to be prepared to deal with here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworded as we discussed offline. Now this reads:
// The value of volume_mount_group should be group
// identifier (as determined by underlying operating system)
// which would be associated with workload that uses the volume.
csi.proto
Outdated
// within the volume are readable and writable by the provided | ||
// volume_mount_group. | ||
// If NodeStageVolume was previously called with volume_mount_group | ||
// CO MUST ensure that NodePublishVolume uses the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we said here this is a SHOULD not a MUST. I think the semantics we want is that, if the CO does pass the same volume_mount_group as it did to stage, then the SP MUST NOT fail for reason related to the group. If the CO passes a different group, then the operation MAY fail and that should be communicated in the error code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought - if NodePublish
gets called with different value than NodeStage
- it just isn't going to work for ANY storage provider because volume was already staged with a different group identifier, so there is no point in using SHOULD language there.
The place where we wanted to use SHOULD or MAY is - if SP did not had NodeStage
but two or more NodePublish
was called with different volume_mount_group
for same volume (on same node) but different target paths. Such a case MAY or MAY not be supported by SP and hence I covered this case with an error code:
| Volume already published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the volume with specified
volume_id
was already node published using differentvolume_mount_group
on this node and can not be node published. | Caller MUST ensure thatNodePublishVolume
is called with samevolume_mount_group
on all target paths. |
csi.proto
Outdated
// Indicates that Node service supports mounting volumes | ||
// with provided volume group identifier during node stage | ||
// or node publish RPC calls. | ||
// It is expected that SP SHOULD use provided volume_mount_group |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to say here that if the capability is asserted, and the CO passes a non-nil group, then it MUST be respected. I don't think SHOULD is the right language here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still waiting for morning coffee to kick in, but I've taken a stab at some initial feedback anyway.
spec.md
Outdated
|
||
// If SP has VOLUME_MOUNT_GROUP node capability and CO provides | ||
// this field then SP MUST ensure that volume is mounted with | ||
// provided volume_mount_group and all files and directories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned that the wording "all files and directories within the volume are readable and writable by ..." leaks implementation detail, thinking that it should be removed. In terms of guarantees, the spec should aim no higher than what's provided by mount
(and the OS driver implementation). Driver implementations of mount
are already a black box and it feels odd to layer additional requirements on top of a underspecified interface. Furthermore, just because k8s wants to implement a recursive chown
doesn't mean that it's appropriate for all CSI driver implementations, or all CO implementations/deployments.
Also, removing the recursive chown
wording makes this more interesting for block volumes. Perhaps ownership means something for those volumes too? And if so, do we really want to tie this to "mount", or maybe we need a different word if "mount" is only for fs-type volumes (and vs block-type volumes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Furthermore, just because k8s wants to implement a recursive chown doesn't mean that it's appropriate for all CSI driver implementations, or all CO implementations/deployments.
Just to clarify - if a driver has this capability, k8s WILL not chown/chmod
files belonging to this volume. In stead the plan is - k8s will assume SP has done its job and volume has right permissions after nodestage/nodepublish. Also - the reason we need this as a mount flag is mainly because in many cases k8s/CO can not recursively chown
files (for example - cifs mount without unix extensions).
'm concerned that the wording "all files and directories within the volume are readable and writable by ..." leaks implementation detail, thinking that it should be removed.
Yes the wording does gets into implementation details. My reasoning as originally discussed with @bswartz was - we do not want SPs to use this field to recursively chown/chmod files at all. Because chown/chmod is very much CO specific thing and we actively wanted to prevent SPs from doing that. When using chown
for example - newly written files on the volume may not be group writable and hence would violate the CSI spec. But I can see - how this leaks into mount implementation details too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not happy with this wording:
... then SP MUST ensure that volume is mounted with provided volume_mount_group and all files and directories within the volume are readable and writable by the provided volume_mount_group.
(a) what does "mounted with provided volume_mount_group" actually mean?
(b) all files + dirs readable/writable sounds like a recursive chown to me, unless this is a validation check? which doesn't really make sense in this context.
I'd like to take a stab at some alt wording, which attempts to decrease the scope of what CSI is willing to promise:
"... then the SP MUST ensure that the volume_mount_group
parameter is passed as the group (primary group, if possible) identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, new file and/or directory entries written to the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are both readable and writable by said mount group identifier."
.. which tries to get away from saying that everything on a preexisting volume will be read/write-able (which is gets into recursive chown territory), and also tries to cut to the chase of the requirement that I think i'm hearing - that newly written blobs are the concern here, and they should (by default) be labeled with the group ID presented here.
also, if a workload chown's files on the volume such that it can no longer read them .. well, tough luck. CSI is not a data plane API. it's up to the SP implementation and/or OS to restrict such operations if that behavior is undesireable. CSI should not promise anything w/ respect to what I/O operations a workload is allowed to execute (because CSI is a control plane API).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think updated wording looks fine. The only question I have is - why primary group if possible? k8s at least will almost never pass primary group of the workload.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking some more about it and I agree that your updated wording meets the requirement and everything, but one more thing which has been bothering me somewhat is - we don't really need to call out permissions of new files/directories. The wording does imply that - chown/chmod may not be enough but what we really want SPs to use is - mount options and hoping that mount operation with provided group identifier is enough to apply necessary permissions.
So I am thinking two alternative wordings:
"...then the SP MUST ensure that the volume_mount_group parameter is passed as the group identifier to the underlying OS mount system call,
with the understanding that this would result in volume being readable(if published as readonly) or both readable and writable by said mount group identifier. The set of available mount call parameters and/or mount implementations may vary across operating systems." --> This covers readonly interaction as well.
Or something like:
"... then the SP MUST ensure that the volume_mount_group parameter is passed as the group identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, it is expected file and/or directory entries on the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are either readable(in case of readonly volumes) or both readable and writable by said mount group identifier." --> covers interaction with readonly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My wording says the permission labeling (ala the group option) targets new file(s)/dir(s); your wording doesn't include the "new", which in my mind translates to "recursive chown".
If you did not intend to drop "new" from the wording, please add it back. Because maybe you're only trying to amend my wording to account for (a) readable files in the read-only case, or else; (b) read/write-able files otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern with accounting for "new" files is - are we being too specific? A SP may violate this and still satisfy what CO expects. But to be clear we do not have that problem at least with Azurefile/CIFS-Samba mounts. It works exactly like how you worded and it should be fine. So may be I am over-thinking this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright - I have incorporated this wording. I thought again about readonly case and I do not think it is important to call that out here again, because readonly can be enforced in different ways than by combining with gid mount options.
spec.md
Outdated
|
||
// If SP has VOLUME_MOUNT_GROUP node capability and CO provides | ||
// this field then SP MUST ensure that volume is mounted with | ||
// provided volume_mount_group and all files and directories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto re: recursive chown requirement
spec.md
Outdated
@@ -2260,6 +2288,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t | |||
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability` or `readonly` flag. | Caller MUST fix the arguments before retrying. | | |||
| Exceeds capabilities | 9 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. | | |||
| Staging target path not set | 9 FAILED_PRECONDITION | Indicates that `STAGE_UNSTAGE_VOLUME` capability is set but no `staging_target_path` was set. | Caller MUST make sure call to `NodeStageVolume` is made and returns success before retrying with valid `staging_target_path`. | | |||
| Volume staged with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that volume with specified `volume_id` was node staged using different `volume_mount_group` on this node and hence can not be node published. | Caller MUST make sure that `NodePublishVolume` is called with same `volume_mount_group` which was used in `NodeStageVolume`. | | |||
| Volume already published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the volume with specified `volume_id` was already node published using different `volume_mount_group` on this node and can not be node published. | Caller MUST ensure that `NodePublishVolume` is called with same `volume_mount_group` on all target paths. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(a) if this is case is to cover the scenario when node-stage is unimplemented, please say so explicitly.
(b) unclear whether this behavior is "optional", meaning that if some implementations support multiple publish calls w/ different groups are they still required to throw this error? from earlier review comments in the spec it seems like this was to be optionally supported, but the language here does not make that clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some language to clarify the intent - 94346e3#diff-bc6661da34ecae62fbe724bb93fd69b91a7f81143f2683a81163231de7e3b545R2267
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I have gone back little bit on this one and updated the wording to be same for both nodestage->nodepublish and nodepublish->nodepublish. Originally, the wording said CO MUST use same volume_mount_group identifier with NodePublish which was used during NodeStage. I have relaxed with following wording:
// If Plugin does not support NodePublishVolume with different
// volume_mount_group than the one used during NodeStageVolume
// then Plugin MAY return FAILED_PRECONDITION error.
// Similarly if SP does not support NodePublishVolume of same volume
// on same node but with different volume_mount_group it MAY return
// FAILED_PRECONDITION error.
The idea is - some plugins may not perform actual mount on NodeStage and they may only do some housekeeping during staging and hence if underlying driver supports it, multiple NodePublish of same volume can use different gid than what was specified on NodeStage. It did not make sense to penalize a Plugin for implementing NodeStage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @bswartz ^
spec.md
Outdated
// identifier (as determined by underlying operating system) | ||
// which would be associated with workload that uses the volume. | ||
// This is an OPTIONAL field. | ||
string volume_mount_group = 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so the next thing that someone will ask for is "mount user". and then someone else will want "mount SELinux labels". and then "extended ACLs". and so on.
at the risk of over-engineering a solution here, i'm wondering if it might make sense to have a top-level type, e.g. VolumeOwnership
to capture the intent:
message VolumeOwnership {
string primary_group = 1;
}
and then the field in this message becomes:
VolumeOwnership volume_mount_owner = 9;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The concern about mount_user
is valid. We may need that in future. SELinux however is solvable without any CSI spec change - see - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/1710-selinux-relabeling/README.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also the reason we need this as explicit field in nodestage/nodepublish
is because - unlike selinux mount options which has a standard key/value pair (like mount -o context=<>
), the group mount option has no such standard format. So CO does not know how to format it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did we conclude on this? I think it would be nice to leave room for future options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, so I was trying to update the spec for moving volume_mount_group
to its own message type, but there are some issues to think about.
Currently the spec says, if volume was staged with one group identifier, it can't be node published with different group identifier on same node. This makes sense for group identifier but what if in future it does not make sense for uid or something else?
Using SELinux as an exmaple -if we were to add selinux_context
to VolumeOwnership
message then, it would be an error to NodePublish a volume with different selinux contexts on same node, it just won't work. But it does work for group identifier for many CIFS implementations at least.
So bringing all of ownership related options within one message can also complicate such future changes. What if behaviour of those ownership related flags are different and their interaction with CO and workload is different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the restriction have to apply at the higher level VolumeOwnership or can it apply on individual fields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect restrictions to apply to individual fields but error handling and recovery has to be appropriately adjusted so as it can apply to individual fields.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The trouble is basically - appropriately wording VolumeOwnership
type which can apply to both NodeStage and NodePublish and then defining errors and recovery for both stage and publish in same place and type and then ensuring that error handling and recovery appropriately applies to all the fields and is easy to extend in future.
This IMO leads to excessively long definition of type volume_mount_group
(because among other things now we have to define both publish and stage time behaviour) for example.
It is not impossible but I am not sure if this is worth the tradeoff. @jdef @msau42 what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could consider other naming:
message VolumePrincipal {
string group = 1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can certainly do this. Did we want this as part of VolumeCapability
field?
@jdef My suggestion was not to expose implementation detail, but to highlight the fact that the ownership should be a continuous thing over the life of the publish, and not a discrete one time action at publish time. We shouldn't care how the plugin achieves that, but my goal was to rule out things like a chown. |
I am happy with this wording. Any other outstanding changes before we approve? |
@bswartz I added additional language in |
spec.md
Outdated
|
||
// If SP has VOLUME_MOUNT_GROUP node capability and CO provides | ||
// this field then SP MUST ensure that volume is mounted with | ||
// provided volume_mount_group and all files and directories |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not happy with this wording:
... then SP MUST ensure that volume is mounted with provided volume_mount_group and all files and directories within the volume are readable and writable by the provided volume_mount_group.
(a) what does "mounted with provided volume_mount_group" actually mean?
(b) all files + dirs readable/writable sounds like a recursive chown to me, unless this is a validation check? which doesn't really make sense in this context.
I'd like to take a stab at some alt wording, which attempts to decrease the scope of what CSI is willing to promise:
"... then the SP MUST ensure that the volume_mount_group
parameter is passed as the group (primary group, if possible) identifier to the underlying OS mount system call, with the understanding that the set of available mount call parameters and/or mount implementations may vary across operating systems. Additionally, new file and/or directory entries written to the underlying filesystem SHOULD be permission-labeled in such a manner, unless otherwise modified by a workload, that they are both readable and writable by said mount group identifier."
.. which tries to get away from saying that everything on a preexisting volume will be read/write-able (which is gets into recursive chown territory), and also tries to cut to the chase of the requirement that I think i'm hearing - that newly written blobs are the concern here, and they should (by default) be labeled with the group ID presented here.
also, if a workload chown's files on the volume such that it can no longer read them .. well, tough luck. CSI is not a data plane API. it's up to the SP implementation and/or OS to restrict such operations if that behavior is undesireable. CSI should not promise anything w/ respect to what I/O operations a workload is allowed to execute (because CSI is a control plane API).
Feel free to strike the "primary", I think I really wanted "default" which
I think is captured later on
…On Fri, Feb 19, 2021 at 2:41 PM Hemant Kumar ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In spec.md
<#468 (comment)>
:
> @@ -2091,6 +2091,17 @@ message NodeStageVolumeRequest {
// This field is OPTIONAL and MUST match the volume_context of the
// volume identified by `volume_id`.
map<string, string> volume_context = 6;
+
+ // If SP has VOLUME_MOUNT_GROUP node capability and CO provides
+ // this field then SP MUST ensure that volume is mounted with
+ // provided volume_mount_group and all files and directories
I think updated wording looks fine. The only question I have is - why
primary group if possible? k8s at least will never pass primary group of
the workload.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#468 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR5KLFR4P42KZR6KQOUBWTS725HBANCNFSM4WK7AOLA>
.
--
James DeFelice
585.241.9488 (voice)
650.649.6071 (fax)
|
I need to think through the error cases a bit more. I'm a bit concerned
about the complexity added here. I'd rather err on the side of caution,
meaning more conservative .. sacrificing flexibility for simplicity.
…On Tue, Mar 2, 2021, 10:47 PM Hemant Kumar ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In spec.md
<#468 (comment)>
:
> @@ -2260,6 +2288,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability` or `readonly` flag. | Caller MUST fix the arguments before retrying. |
| Exceeds capabilities | 9 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. |
| Staging target path not set | 9 FAILED_PRECONDITION | Indicates that `STAGE_UNSTAGE_VOLUME` capability is set but no `staging_target_path` was set. | Caller MUST make sure call to `NodeStageVolume` is made and returns success before retrying with valid `staging_target_path`. |
+| Volume staged with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that volume with specified `volume_id` was node staged using different `volume_mount_group` on this node and hence can not be node published. | Caller MUST make sure that `NodePublishVolume` is called with same `volume_mount_group` which was used in `NodeStageVolume`. |
+| Volume already published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that the volume with specified `volume_id` was already node published using different `volume_mount_group` on this node and can not be node published. | Caller MUST ensure that `NodePublishVolume` is called with same `volume_mount_group` on all target paths. |
cc @bswartz <https://github.com/bswartz> ^
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#468 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAR5KLDFM4TEDH7CJQQACB3TBWWNLANCNFSM4WK7AOLA>
.
|
@jdef thanks. Do you think we can get this in for next CSI release (i.e 1.4) or it is better suited for 1.5 ? I need to make some k8s changes and if we can't make this in 1.4 then I would rather avoid extra work. |
Posting an update since this PR will likely miss the cut - I am waiting for some explanation from @jdef on #468 (comment) . We could make the value of |
@jdef so one thing we can do to sacrifice some of the flexibility and reduce two different errors in one is - enforce that for same volume_id, both node stage and node publish and multiple instances of node publish on same node, has to always use same This will allow us to squash two different errors in one:
|
b3f3c6f
to
98893c5
Compare
Thanks @gnufied Can you provide examples of a couple of CSI Driver and how they would implement this both for linux and for windows? e.g. "AzureFile will convert this in to a Sorry if this was mentioned else where, just want to wrap my head around the change. |
On Linux Azurefile CSI driver would convert the supplied
On Windows currently at least in k8s, there is no implementation of using fsgroups/gids for pods. But when I spoke with @ddebroy on slack, he suggested that supplied group_id can be a SID that can be used to lock down volume for specific security identifier (https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/security-identifiers-in-windows). |
I spoke with @gnufied offline about this feature. He explained:
In general I worry about adding any permissions logic to CSI for a couple of reasons: 1) I question if manipulating file permissions actually adds any real security, and 2) ensuring file permissions are added in a way that will work across all OS/filesystem combos will be hard. For my first concern, @gnufied explained that many apps won't work with world readable files, and so "meeting the user where they are" is one reason we need fsgroup functionality, and the other argument is security* (with an asterix because you only get security benefits if you also limit the ranges pods can have, and/or dynamically assign the ranges). For my second concern, @gnufied explained this approach should work for both windows and linux at least for azurefile. Given how important this is for AzureFile, I'm keen to approve this. @jdef @ddebroy any objections? |
LGTM. No objections based on relevance of this for Azurefile as well as any other CSI plugin that leverages SMB. |
One thing which does not give me clarity here is the reference to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for following up on this. i think this proposal needs a bit more work, left some feedback. i'm a bit concerned that the OP still has questions about integration/e2e flows between CO and SP. i'm wondering if we'd benefit from some kind thought-experiment-in-the-form-of-docs that walks through a hypothetical CO/SP integration, calling out requisite configuration and how that manifests at runtime via CSI calls, and resulting system state.
spec.md
Outdated
@@ -2315,6 +2348,8 @@ The CO MUST implement the specified error recovery behavior when it encounters t | |||
| Volume published but is incompatible | 6 ALREADY_EXISTS | Indicates that a volume corresponding to the specified `volume_id` has already been published at the specified `target_path` but is incompatible with the specified `volume_capability` or `readonly` flag. | Caller MUST fix the arguments before retrying. | | |||
| Exceeds capabilities | 9 FAILED_PRECONDITION | Indicates that the CO has exceeded the volume's capabilities because the volume does not have MULTI_NODE capability. | Caller MAY choose to call `ValidateVolumeCapabilities` to validate the volume capabilities, or wait for the volume to be unpublished on the node. | | |||
| Staging target path not set | 9 FAILED_PRECONDITION | Indicates that `STAGE_UNSTAGE_VOLUME` capability is set but no `staging_target_path` was set. | Caller MUST make sure call to `NodeStageVolume` is made and returns success before retrying with valid `staging_target_path`. | | |||
| Volume staged or published with different volume_mount_group | 9 FAILED_PRECONDITION | Indicates that volume with specified `volume_id` was node staged or published using different `volume_mount_group` on this node and hence can not be node published. | Caller MUST make sure that all `NodePublishVolume` calls specify same value of `volume_mount_group`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have we considered how this feature works in light of the new single_node_multi_writer functionality being proposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the definition of volume_mount_group
is more strict and requires that all node-publish RPC calls use same value of volume_mount_group
. So basically - if T1!=T2
and P1!=P2
(where Pn includes volume_mount_group
) then - for single_node_multi_writer volumes, node-publish should fail.
Having said that - we can relax the wording once #476 merges or if this merges first then we can update #476. The relaxed wording will be - if SP supports both SINGLE_NODE_MULTI_WRITER
and VOLUME_MOUNT_GROUP
capability then it MUST allow NodePublishing same volume multiple times on same node with different volume_mount_group
. The cifs implementations that I know already support that (you can mount same volume multiple times on the node with different gid).
The language here is however more strict - because we wanted to start with something strict and then relax the wording if needed (an earlier version of design did allow aformentioned flow).
spec.md
Outdated
@@ -2134,6 +2134,19 @@ message NodeStageVolumeRequest { | |||
// This field is OPTIONAL and MUST match the volume_context of the | |||
// volume identified by `volume_id`. | |||
map<string, string> volume_context = 6; | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this field is declared both for node-stage, and for node-publish. why not declare it ONCE as part of VolumeCapability
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or, if this is only for mount .. then perhaps as part of MountVolume
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and what does this field mean in the context of block vols?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible to use mount flags defined in volume capability (inside MountVolume
) because:
- The flags defined inside volume capability is static and part of
CreateVolume
response - whereas this mount option depends on workload that is trying to use the volume and could change. - Say even if we pretend that volume capability can be modified at any point in time and is not tied to
Volume
object. How does CO know it should supply-o gid=<>
mount flag during NodeStage or NodePublish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm - I think your response meant the flag not be included as mount_flags
inside MountVolume
type - but rather a separate field in MountVolume
type? I misunderstood it to be the former.
Yes having a separate volume_mount_group
field inside MountVolume
could work. It would have same semantic meaning as when it is defined outside (as current PR). The main concern though is - it would perhaps alter meaning of volume capability. But if that is acceptable trade-off I can make that change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have moved this field to VolumeCapability
with everything else being the same. I think this also allows us to cleanup the error handling and now error handling simply is determined by based on the criteria if SP supports publishing a volume with different capabilities on same node. It also addresses interaction of this PR with #476
spec.md
Outdated
// identifier (as determined by underlying operating system) | ||
// which would be associated with workload that uses the volume. | ||
// This is an OPTIONAL field. | ||
string volume_mount_group = 9; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could consider other naming:
message VolumePrincipal {
string group = 1;
}
Trying to address both @humblec and @jdef concerns here - there is a k8s KEP which describes how this will be used - https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/2317-fsgroup-on-mount/README.md . Beyond that, I have written some code that uses this new field and integration with CO and SP is pretty straightforward. @humblec the supplemental groups and fsgroups are CO's internal implementation details that do not necessarily bubble up to CSI spec. But for completeness sake, I will try and clarify here - in k8s supplemental groups are additional groups that are added to a pod (not the primary GID). They can be used to match gid of the underlying volume, so as k8s does not have to perform any permission/ownership change if supplemental group already match volume's underlying group ownership/permissions. Supplemental groups do not result in any permission change of the volume and hence is not useful with problem we are trying to solve. |
Sure, supplemental group is a general concept (*nix) too, so brought it up.
True, made use of that heavily in GlusterFS plugin under kube CO to overcome permission denied error and such.
In short, the |
@gnufied with this interface change, and also k/k code change, the following
|
@andyzhangx yes that is accurate. |
@jdef @saad-ali @bswartz @msau42 please take another look. As discussed #468 (comment) , I have moved the new field to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
PTAL ASAP trying to get this merged and new release cut on Monday. |
@gnufied just to complete my understanding, if the volume is |
Sure @saad-ali, the change looks good to me !! 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM mod nits
In the case of the SINGLE_NODE_MULTI_WRITER flow (another PR looking to land ASAP), is it legal to mount the same volume to multiple workloads on the same node w/ different volume_mount_group's ? If not, should it be? I have my doubts as to whether all SPs (or even OSs) can support this. May be worth calling out somewhere/somehow as an error case that CO will need to deal with.
@jdef the SINGLE_NODE_MULTI_WRITER PR has wording that says - a CO SHOULD not call I can add more explicit wording - but I thought that language covers this limitation well enough. @humblec Generally speaking yes it is possible that CO will call node-stage/node-publish with different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
LGTM. No objections based on relevance of this for Azurefile as well as any other CSI plugin that leverages SMB.
LGTM mod nits
Sure @saad-ali, the change looks good to me !! 👍
Thanks folks! Will go ahead and merge this.
Apply suggestions from code review Co-authored-by: jdef <2348332+jdef@users.noreply.github.com>
Thanks for the rebase. Merging now. |
Adds an alpha field to
NodeStage
andNodePublish
which would allow CO to supply group identifier of workload, so as volume can have group permissions when published to a node.Fixes #449