-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for ID map mounts without userns #3943
Conversation
6306e71
to
8c3234a
Compare
Commit fda12ab ("Support idmap mounts on volumes") introduces support for idmap mount on volumes. At this time, it required userns to be used and idmap mount must have the same UID/GID mappings than userns. This commit removes this requirement, so it is possible to use idmap mount without userns. You can, of course, use id map mount with userns, but the UID/GID mappings can be different. Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
Signed-off-by: Francis Laniel <flaniel@linux.microsoft.com>
8c3234a
to
a7afcd3
Compare
I added some tests locally particularly when several UID/GID mappings are used and the whole thing hang. |
cc @rata |
Could you update: Lines 21 to 25 in b4f3891
|
int gidmap_len = strlen(gidmap_src); | ||
|
||
/* Update child mappings from the parent. */ | ||
update_uidmap(config->uidmappath, child_pid, uidmap_src, uidmap_len); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we don't check the conflict of idmap items when len(UIDMappings) > 1
, so will get an error in update_uidmap
when the uid map data is invalid, but it uses bail
to print the error and call exit
, so it will not kill child_pid at that time and will got stuck. So we should make two changes:
- validate there is a conflict of idmap items or not;
- don't use bail to print error msg and return an error in
update_uidmap
. (runc/libcontainer/nsenter/nsexec.c
Lines 288 to 301 in a7afcd3
static void update_uidmap(const char *path, int pid, char *map, size_t map_len) { if (map == NULL || map_len == 0) return; write_log(DEBUG, "update /proc/%d/uid_map to '%s'", pid, map); if (write_file(map, map_len, "/proc/%d/uid_map", pid) < 0) { if (errno != EPERM) bail("failed to update /proc/%d/uid_map", pid); write_log(DEBUG, "update /proc/%d/uid_map got -EPERM (trying %s)", pid, path); if (try_mapping_tool(path, pid, map, map_len)) bail("failed to use newuid map on %d", pid); } }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are the same problems in update_setgroups
and update_gidmap
.
So, when we clone a new process, we should double check whether there is a bail
or not when we call a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
I will wait a bit to see the results of the below discussion, but if we decide to go with the current solution I will for sure add a commit to change update_*()
to return an error code and bail in the caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this as well. I'm playing around with using PR_SET_PDEATHSIG
to avoid us having to remember to kill children...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, NACK. I'm working on my own implementation of this and I don't want us to add any more code to nsexec.c
for this feature. We have two other options for how to implement this:
- Use
syscall.ForkExec
to completely safely spawn a subprocess with the right mappings, then use the new mount API entirely in Go for the setup and pass the fd torootfs_linux.go
. - Spawn a child (slightly unsafely) with CGO. While this is not recommended for thread-safety reasons, if the child does nothing other than
kill(getpid(), SIGSTOP)
there should be no async signal-safety issues. This would be faster than option (1).
And while we're at it, we can implement the mount sources logic using the new mount API to avoid needing that code in nsexec.c
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, NACK. I'm working on my own implementation of this and I don't want us to add any more code to
nsexec.c
for this feature. We have two other options for how to implement this:
Can you share some source code regarding your implementation?
1. Use `syscall.ForkExec` to completely safely spawn a subprocess with the right mappings, then use the new mount API entirely in Go for the setup and pass the fd to `rootfs_linux.go`. 2. Spawn a child (slightly unsafely) with CGO. While this is not recommended for thread-safety reasons, if the child does nothing other than `kill(getpid(), SIGSTOP)` there should be no async signal-safety issues. This would be faster than option (1).
I do not think any of them will work, as we need to ensure the child is in another user namespace to be able to effectively set the UID/GID mappings and then doing do the ID map mount with the children user namespace.
To do so, using clone()
seems to be the only solution we have.
And while we're at it, we can implement the mount sources logic using the new mount API to avoid needing that code in
nsexec.c
as well.
More generally, I do not understand your comment about adding code in nsexec.c
.
The first PR of this topic was done by adding code in this file and no one complained about it as, as I pointed above, there are no other solution.
I'll post a PR once I have it working. I'm also doing some cleanups of
Both options I listed would result in a child process in a new user namespace.
The first version of this PR was already implemented and was piggy-backing on the mountfd code that already existed, so rejecting it for that reason would've been unreasonable. At the time I thought there wasn't another way of doing the mountfd handling, but I'd forgotten that If you'd prefer to have this code reviewed, merged, and then later deleted in a rework, we can do it that way too -- I just felt it'd be better to let you know that I am working on a rework of these features so that you don't waste your time working on code that will be removed. |
No problem, I share your mind regarding coding the most of thing in Golang would be better.
OK, I think I lack some background there.
Do you have any ETA for your work? |
Oh, cool. While you are at it, maybe we can avoid the fd passing mechanism (if we open the fds before forking/clone/whatever, maybe we can just keep the fd table)? That is something that was on my list since I saw the code, but I'm not sure it will be simple to remove it, crun has it too...
From a consumer POV (we did this for Kubernetes), we don't need to lift this limitation. So, I'm ok if you plan to work on this other implementation and lift the limitation there. Can you please cc me in the PR? :) |
Agree with @cyphar -- if we can do it in Go, we should do it in Go. Overall I very much hope we'll eventually be able to do all of it in Go. For example, with cgroupfd support in the kernel (since v5.7) and golang stdlib (since 1.20), we can enter cgroups way easier. |
@rata We can avoid the fd passing mechanism, but I'm a little bit concerned about @kolyshkin Never say never, and I would love to remove all the C code from our codebase, but I'm not sure if even on newer kernels and with the newest stdlib we will be able to do that (at the very least I don't think Go has handling for @eiffel-fl I should have a PoC together this week. I'm doing some other cleanups at the same time, so it might take me a little longer... |
OK, just ping me in the PR once you open it, so I can take a look (but no emergency, take the time you need to polish everything). |
#3953 has a working implementation of this, along with several other cleanups to nsexec. There is a single test failure related to criu (which I can't reproduce locally) that I'm debugging, but the code clearly works. |
I will take a look at it this week! |
Closing as #3953 supersedes it. |
Reopening to track this properly. This will be fixed by #3985. |
Ah, this is a PR. Oops. |
Hi.
This PR improves the initial support for ID map mounts which was merged in #3717.
Now, it is possible to use ID map mounts without the need of having a user namespace.
Regarding the design, I added two new attribute to send the UID and GID mappings corresponding to the mount sources.
I would like to have your opinion on this.
Best regards and thank you in advance.