IsHostUnprivileged flag for host #228

ashahab-altiscale · 2015-10-21T23:03:14Z

This flag indicates that the host that is launching the container is
unprivileged. Setting this to true ensures that implementations will be
able avoid erroring on actions that unprivileged hosts cannot
perform(such as mknod).

An example of this is in runc: opencontainers/runc#351
Signed-off-by: Abin Shahab ashahab@altiscale.com

This flag indicates that the host that is launching the container is unprivileged. Setting this to true ensures that implementations will be able avoid erroring on actions that unprivileged hosts cannot perform(such as mknod). Signed-off-by: Abin Shahab <ashahab@altiscale.com>

ashahab-altiscale · 2015-10-21T23:04:22Z

@LK4D4 @vishh @avagin What do you guys think?

wking · 2015-10-21T23:54:07Z

On Wed, Oct 21, 2015 at 04:03:16PM -0700, Abin Shahab wrote:

This flag indicates that the host that is launching the container is
unprivileged. Setting this to true ensures that implementations will
be able avoid erroring on actions that unprivileged hosts cannot
perform(such as mknod).

“Unprivileged” isn't very clear. It looks like
opencontainers/runc#351 is using this to switch between bind-mounting
or mknod-ing devices. It seems like you should be able to check
CAP_MKNOD for that 1, and not have an additional flag here.

Stepping back, my personal preference would be to not have default
actions like mknod-ing devices. That seems like it could be easily
handled in pre-start hooks for users that want it (and we could even
provide a hook script for the devices from #164). I think trying to
paper over all the special cases about when you're allowed to do what
is just going to add more confusion to an already complicated subject.

ashahab-altiscale · 2015-10-22T00:19:24Z

@wking If I launched a container with a non-root user from host mapped as the root in the container, and then try launching a container from inside the container, this child container launch is going to fail even if the CAP_MKNOD is enabled, as this is an inability of the "host"(which is an unprivileged container).

wking · 2015-10-22T03:33:39Z

On Wed, Oct 21, 2015 at 05:19:25PM -0700, Abin Shahab wrote:

@wking If I launched a container with a non-root user from host
mapped as the root in the container, and then try launching a
container from inside the container, this child container launch is
going to fail even if the CAP_MKNOD is enabled, as this is an
inability of the "host"(which is an unprivileged container).

Oh, I see. I that case (and if we have keep the mknod calls enabled
by default), how about we just revist my third point from 1 and have
folks who know they don't have mknod permissions bind-mount their host
/dev themselves before the pivot root?

And again, I think this whole thing is much easier with the “op-into
everything” approach taken by ccon, where you put your pivot root in
explicitly if and where you need it (see 2) and we leave mknod
creation (if any) to user-specified pre-start hooks.

LK4D4 · 2015-10-22T03:48:41Z

Makes sense for me. However maybe we can find better naming.

ashahab-altiscale · 2015-10-22T04:14:29Z

@wking I agree with your assessment. Thanks for the links.
However, there's another issue with an unprivileged host. It cannot write to cgroups/devices as runC/libcontainer right now tries to. This is probably a bug in cgroups, but right now we can either (1) catch and ignore that error, or (2) use the above special flag to avoid writing to devices. I'd much rather not use the special flag, but doing (1) is basically eating an exception, and that may cause bad bugs later.

wking · 2015-10-22T16:22:11Z

On Wed, Oct 21, 2015 at 09:14:30PM -0700, Abin Shahab wrote:

However, there's another issue with an unprivileged host. It cannot
write to cgroups/devices as runC/libcontainer right now tries to.

Is that just because of the uid:gid ownership of the cgroup files?
Folks who want cgroup handling by unprivileged users will probably
need a privileged user to setup subgroups they can manage (e.g. see
1).

This is probably a bug in cgroups, but right now we can either (1)
catch and ignore that error, or (2) use the above special flag to
avoid writing to devices. I'd much rather not use the special flag,
but doing (1) is basically eating an exception, and that may cause
bad bugs later.

Or (3) adjust the spec to make this sort of activity op-in, or (4)
adjust the spec to add opt-out flags specific to each automatic
action. IsHostUnprivileged (2) bridges these with a single flag
covering all opt-outs (a less-granular (4)) without documenting what
it's opting you out of, and it's opting you into a /dev bind-mount at
the same time (like (3)).

crosbymichael · 2015-10-22T17:24:58Z

I don't think i'm that big of a fan of this change. Let me think about it more but I think we can handle these type of things internally in a runtime.

ashahab-altiscale · 2015-10-22T17:31:05Z

Sure, I would prefer handling it internally if we can read some state that
tells us this host is unable to perform certain automatic task(writing to
devices.allow and devices.deny). I like the idea of the opting out also.

On Thu, Oct 22, 2015 at 10:25 AM, Michael Crosby notifications@github.com
wrote:

I don't think i'm that big of a fan of this change. Let me think about it
more but I think we can handle these type of things internally in a runtime.

—
Reply to this email directly or view it on GitHub
#228 (comment).

crosbymichael · 2015-10-22T17:32:50Z

Is this because you want to support nested user namespaces?

ashahab-altiscale · 2015-10-22T17:37:16Z

@crosbymichael Yes, that is our usecase: opencontainers/runc#350
More generally, it would be great to be able to launch containers from inside a user namespaced container that nests the namespaces and cgroups and the launcher(invoker of runC) is impervious of the fact that the "host" is user namespaced.

dqminh · 2015-10-22T21:45:10Z

I dont think bring this right now into the specs is a good idea. In fact the approach might be flawed. Without cgroup namespacing, i think the only possible way to do it cleanly right now is to communicate to an external privileged process that can manage cgroups for us ( like systemd if possible or cgmanager ).

wking · 2015-10-22T21:57:09Z

On Thu, Oct 22, 2015 at 02:45:12PM -0700, Daniel, Dao Quang Minh wrote:

Without cgroup namespacing, i think the only possible way to do it
cleanly right now is to communicate to an external privileged
process that can manage cgroups for us ( like systemd if possible or
cgmanager ).

Can someone point me at more details on the trouble with cgroups? It
seems like the right approach is just “have the sysadmin make per-user
playgrounds for unprivileged cgroup manipulation” and that works for
me for the freezer cgroup 1. Is this something about the device
cgroup in particular? Potentially the hierarchy bubbling discussed in
§4 of 2?

ashahab-altiscale · 2015-10-22T22:03:44Z

@wking Yes this is a specific issue with the devices cgroup. Probably because of this: "In case one of the locally set whitelist entries would provide more access than the cgroup's parent, it'll be removed from the whitelist."
For all other cgroups, my approach works fine(I create the container's cgroup ahead of time, and make the contianer's root an admin of that cgroup).
One thing @vishh brought up at yesterday's Hackathon is that the sysadmin can just pass a fully created cgroup to runC, and in this case all runC should do is to add pids to the tasks file of the cgroups. I think that approach can work for me just fine and I would not need any changes to spec. Of course, runC's current approach to manipulating cgroups should change.

wking · 2015-10-22T22:19:15Z

On Thu, Oct 22, 2015 at 03:03:46PM -0700, Abin Shahab wrote:

@wking Yes this is a specific issue with the devices
cgroup. Probably because of this: "In case one of the locally set
whitelist entries would provide more access than the cgroup's
parent, it'll be removed from the whitelist."

Hmm, that looks like it's propagating down into children, which seems
like something that wouldn't need elevated permissions. I don't see a
fundamental problem with having an unprivileged user write additional
restrictions to a devices.deny group they have write-access to.

For all other cgroups, my approach works fine(I create the
container's cgroup ahead of time, and make the contianer's root an
admin of that cgroup). One thing @vishh brought up at yesterday's
Hackathon is that the sysadmin can just pass a fully created cgroup
to runC, and in this case all runC does is to add pids to the tasks
file of the cgroups. I think that approach can work for me just fine
and I would not need any changes to spec.

+1 to separating cgroups handling from the spec. It seems like it
splits of well from the namespace/fork/mount/pivot-root business, and
could be handled well in a separate layer. If we go that route for
unprivileged users, is there a reason to keep cgroup handling in this
spec at all? If we want to make a separate spec/toolset around cgroup
creation, cleanup, checkpointing, etc., it seems like that could be
independent of namespace dancing.

Or is the proposal just to split off device cgroups? Or is it just to
have unprivileged users not configure anything that would trigger
device cgroup access?

ashahab-altiscale · 2015-10-22T22:29:42Z

@wking Yes. I am with you on your first proposal. I'm not suggesting splitting off devices, nor do I really want special treatment for unprivileged users(despite my PR).

mrunalp · 2015-10-22T23:33:27Z

I think that we should research some ways for the runtime to determine whether it is in a user namespace
(could be hackish for now).

crosbymichael · 2015-10-22T23:35:41Z

@mrunalp yes that is my same feeling. This can be handled at the runtime level and I believe that we had this before that if you got an EPERM on mknod then we did the bind mount for adding devices.

mrunalp · 2015-10-22T23:37:02Z

@crosbymichael Yeah, that's the hack that I had in mind ;)

crosbymichael · 2015-10-22T23:37:59Z

@mrunalp i know you too well ;)

mrunalp · 2015-10-22T23:56:43Z

@crosbymichael 😀

ashahab-altiscale · 2015-10-23T00:01:59Z

@mrunalp @crosbymichael So I can push a PR in runc where mknod and writing devices cgroup ignores EPERM errors. Is that what you guys have in mind? Then we don't need this PR.

crosbymichael · 2015-11-02T20:21:16Z

Thanks for the pr in runc, I think we can close this one now.

ashahab-altiscale mentioned this pull request Oct 23, 2015

Bind mount device nodes on EPERM opencontainers/runc#357

Merged

crosbymichael closed this Nov 2, 2015

IsHostUnprivileged flag for host #228

IsHostUnprivileged flag for host #228

Uh oh!

Conversation

ashahab-altiscale commented Oct 21, 2015

Uh oh!

ashahab-altiscale commented Oct 21, 2015

Uh oh!

wking commented Oct 21, 2015

Uh oh!

ashahab-altiscale commented Oct 22, 2015

Uh oh!

wking commented Oct 22, 2015

Uh oh!

LK4D4 commented Oct 22, 2015

Uh oh!

ashahab-altiscale commented Oct 22, 2015

Uh oh!

wking commented Oct 22, 2015

Uh oh!

crosbymichael commented Oct 22, 2015

Uh oh!

ashahab-altiscale commented Oct 22, 2015

Uh oh!

crosbymichael commented Oct 22, 2015

Uh oh!

ashahab-altiscale commented Oct 22, 2015

Uh oh!

dqminh commented Oct 22, 2015

Uh oh!

wking commented Oct 22, 2015

Uh oh!

ashahab-altiscale commented Oct 22, 2015

Uh oh!

wking commented Oct 22, 2015

Uh oh!

ashahab-altiscale commented Oct 22, 2015

Uh oh!

mrunalp commented Oct 22, 2015

Uh oh!

crosbymichael commented Oct 22, 2015

Uh oh!

mrunalp commented Oct 22, 2015

Uh oh!

crosbymichael commented Oct 22, 2015

Uh oh!

mrunalp commented Oct 22, 2015

Uh oh!

ashahab-altiscale commented Oct 23, 2015

Uh oh!

crosbymichael commented Nov 2, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants