-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
syscall: add PidFD, CgroupFD, and UseCgroupFD options for Linux clone to SysProcAttr #51246
Comments
i think this kernel version is too new, the minimal kernel version is still 2.6.23。 |
This is not a proposal to switch to the new syscall, this is a proposal to make its functionality available to userspace. |
Just to add perspective, 5.3 came out in Sept 2019. |
Retitled. There are three parts to this:
All of this seems unobjectionable. |
This proposal has been added to the active column of the proposals project |
Is the intention for
This is a superset of the arguments to Many of these arguments cannot be safely set by Go code. e.g., setting As an alternative, we could add specific useful features enabled by this syscall. e.g.,
|
Definitely not. Indeed, this proposal can be reduced to adding PidFD and CgroupFD, and making use of clone3 CgroupFD is set (note that PidFD can be obtained via clone(2) already, so it's not clear if the code should switch to using clone3 in this case). I fully agree with your analysis, and should have proposed what you had. (Not really related to the proposal, but as far as the implementation goes, my biggest roadblock is obsoleted code for generating per-arch syscall numbers in src/syscall. The files that are supposed to be auto-generated are now changed manually so I am not exactly sure how to add a new syscall number.) |
Indeed, our system call code is a mess. We hope to clean it up (see #51087 and #15282), but right now it is a pain to change. |
Thanks. In some sense this is really two proposals, I'm not sure if we should split it up. Following the API we came to in #47049 (comment), I'll adjust the proposed API to adding the following to
|
Support for the cgroup FD seems fine and uncontroversial to me. The pid FD case is a bit more interesting. I certainly think it should be possible to get a pid FD. In fact, I think we should use a pid FD whenever possible. I wonder if instead of providing only a low-level Down a level, One possibility to merge the limitations above would be to add |
So, I played with this a bit, and it seems it's easier to have something like this: // PidFD is set to a PID file descriptor referring to the child process,
// if CLONE_PIDFD flag is set in Cloneflags. Available since Linux 5.2.
PidFD int The only thing about it is a slightly unusual way of returning a value. The gist of implementation, using existing clone() call, is this: @@ -213,12 +218,12 @@ func forkAndExecInChild1(argv0 *byte, argv, envv []*byte, chroot, dir *byte, att
runtime_BeforeFork()
locked = true
switch {
- case sys.Cloneflags&CLONE_NEWUSER == 0 && sys.Unshareflags&CLONE_NEWUSER == 0:
+ case sys.Cloneflags&(CLONE_NEWUSER|CLONE_PIDFD) == 0 && sys.Unshareflags&CLONE_NEWUSER == 0:
r1, err1 = rawVforkSyscall(SYS_CLONE, uintptr(SIGCHLD|CLONE_VFORK|CLONE_VM)|sys.Cloneflags)
case runtime.GOARCH == "s390x":
- r1, _, err1 = RawSyscall6(SYS_CLONE, 0, uintptr(SIGCHLD)|sys.Cloneflags, 0, 0, 0, 0)
+ r1, _, err1 = RawSyscall6(SYS_CLONE, 0, uintptr(SIGCHLD)|sys.Cloneflags, 0, uintptr(unsafe.Point
er(&sys.PidFD)), 0, 0)
default:
- r1, _, err1 = RawSyscall6(SYS_CLONE, uintptr(SIGCHLD)|sys.Cloneflags, 0, 0, 0, 0, 0)
+ r1, _, err1 = RawSyscall6(SYS_CLONE, uintptr(SIGCHLD)|sys.Cloneflags, 0, uintptr(unsafe.Pointer(
&sys.PidFD)), 0, 0, 0)
}
if err1 != 0 || r1 != 0 {
// If we're in the parent, we must return immediately We can certainly do it with a pointer, too, if that is preferred. The only problem is adding new constants; at this point, I guess, I'd rather patch all the needed files in place, since the generator is outright broken and fixing this is very out of scope for this. |
Currently |
Retitled per discussion above. The current proposal is #51246 (comment), to add PidFD, Cgroup, Cgroup. Certainly you would expect "Pid *int" to be a process ID, not a process ID file descriptor. Hence "PidFD *int". (In contrast, we have "Ctty int" but I think most people would expect a tty int to be a file descriptor, not a tty number. The question is whether Cgroup is more like Ctty or more like Pid.) |
As far as I know there are no standard integer IDs for cgroups. cgroups are primarily managed as directories within a Even proc files identify cgroups with their string path within the cgroupfs. e.g., the last field below,
The first field is the "hierarchy ID", which is a numeric description for the cgroup type:
I suppose that could be confusing, but again this describes a cgroup type, not a specific cgroup. Despite all that, in hindsight I think Additionally, as far as I know, "cgroup FDs" are not a common concept [1]. In fact, I'm not certain if there is anything you can do with this FD besides pass it to clone. Most cgroup operations are performed on files in the cgroup directory. e.g., an existing process is added to a cgroup by opening the [1] A "cgroup FD" in the context of clone is just open(O_RDONLY) of the cgroup directory. |
Oh, of course, the other obvious thing you could do is use this with openat (e.g., |
@prattmic what you look at is cgroup v1, which is a set of cgroups per controller, i.e. we have a forest. This is being superseded by cgroup v2, where we have a single unified cgroup directory for each process (IOW a tree, not a forest). Cgroup v2 is enabled by default in some modern distros, e.g. recent Fedora. For Ubuntu, you need a kernel options The cgroup parameter to clone3 only supports cgroup v2, and it's a file description to cgroup directory. You're right that there is not much use for cgroupfd outside of this clone3. One other use I can think of is openat2, but implementing openat2 is a separate unrelated topic. (Update: I missed your last comment, @prattmic, where you say the same about openat) |
It sounds like there is enough confusion about cgroups that calling it CgroupFD would help clarify matters. |
Retitled. Does anyone object to adding PidFD, CgroupFD, and UseCgroupFD to the Linux SysProcAttr? |
@kolyshkin, we might want to reserve access to the pidfd to Go itself, so that we have the ability to use it for signal delivery on new enough Linux systems. If we add this option to SysProcAttr, we are essentially giving away the pidfd and giving up the ability to use it in the Go runtime or os package in the future. We might not want to do that. What use for pidfd did you have? Perhaps we should figure out an API that would let you do what you need while the Go runtime or os package still owns the pidfd. (Or maybe if we need two in the future we could use Dup.) |
Implement CLONE_INTO_CGROUP feature, allowing to put a child in a specified cgroup in a clean and simple way. Note that the feature only works for cgroup v2, and requires Linux kernel 5.7 or newer. Using the feature requires a new syscall, clone3. Currently this is the only reason to use clone3, but the code is structured in a way so that other cases may be easily added in the future. Add a test case. While at it, try to simplify the syscall calling code in forkAndExecInChild1, which became complicated over time because: 1. It was using either rawVforkSyscall or RawSyscall6 depending on whether CLONE_NEWUSER was set. 2. On Linux/s390, the first two arguments to clone(2) system call are swapped (which deserved a mention in Linux ABI hall of shame). It was worked around in rawVforkSyscall on s390, but had to be implemented via a switch/case when using RawSyscall6, making the code less clear. Let's - modify rawVforkSyscall to have two arguments (which is also required for clone3); - remove the arguments workaround from s390 asm, instead implementing arguments swap in the caller (which still looks ugly but at least it's done once and is clearly documented now); - use rawVforkSyscall for all cases (since it is essentially similar to RawSyscall6, except for having less parameters, not returning r2, and saving/restoring the return address before/after syscall on 386 and amd64). Updates #51246. Change-Id: Ifcd418ebead9257177338ffbcccd0bdecb94474e Reviewed-on: https://go-review.googlesource.com/c/go/+/417695 Auto-Submit: Ian Lance Taylor <iant@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Ian Lance Taylor <iant@google.com> Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Can we backport this feature to go1.19? Because linux5.3 has been released in 2019 and go1.20 is too new for us. thanks |
I'm not a Go maintainer, but I do maintain a bunch of Go packages. From that perspective, I can suggest you to
|
The constants for these were auto-generated from the C includes into zerrors_linux* files quite some time ago. The generator is currently broken, but some new flags need to be added nevertheless. As the flags won't change and the values are the same for all architectures, we can just define them statically (as it's already done in the runtime package): - remove the CLONE_* constants from zerrors_linux_*.go; - patch mkerrors.sh to not generate CLONE_ constants (in case it will be fixed and used in the future); - add the constants and some comments about them to exec_linux.go, using Linux v5.17 include/uapi/sched.h as the ultimate source. This adds the following new flags: - CLONE_CLEAR_SIGHAND - CLONE_INTO_CGROUP - CLONE_NEWCGROUP - CLONE_NEWTIME - CLONE_PIDFD For golang#51246. Change-Id: I0c635723926218bd403d37e113ee4d62194463a8 Reviewed-on: https://go-review.googlesource.com/c/go/+/407574 Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Joedian Reid <joedian@golang.org> Auto-Submit: Ian Lance Taylor <iant@google.com>
Implement CLONE_INTO_CGROUP feature, allowing to put a child in a specified cgroup in a clean and simple way. Note that the feature only works for cgroup v2, and requires Linux kernel 5.7 or newer. Using the feature requires a new syscall, clone3. Currently this is the only reason to use clone3, but the code is structured in a way so that other cases may be easily added in the future. Add a test case. While at it, try to simplify the syscall calling code in forkAndExecInChild1, which became complicated over time because: 1. It was using either rawVforkSyscall or RawSyscall6 depending on whether CLONE_NEWUSER was set. 2. On Linux/s390, the first two arguments to clone(2) system call are swapped (which deserved a mention in Linux ABI hall of shame). It was worked around in rawVforkSyscall on s390, but had to be implemented via a switch/case when using RawSyscall6, making the code less clear. Let's - modify rawVforkSyscall to have two arguments (which is also required for clone3); - remove the arguments workaround from s390 asm, instead implementing arguments swap in the caller (which still looks ugly but at least it's done once and is clearly documented now); - use rawVforkSyscall for all cases (since it is essentially similar to RawSyscall6, except for having less parameters, not returning r2, and saving/restoring the return address before/after syscall on 386 and amd64). Updates golang#51246. Change-Id: Ifcd418ebead9257177338ffbcccd0bdecb94474e Reviewed-on: https://go-review.googlesource.com/c/go/+/417695 Auto-Submit: Ian Lance Taylor <iant@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> Run-TryBot: Ian Lance Taylor <iant@google.com> Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com> TryBot-Result: Gopher Robot <gobot@golang.org>
@zhuangqh Sorry, we don't backport new features like this to older Go releases. See https://go.dev/wiki/MinorReleases for our backport policy. |
Change https://go.dev/cl/520266 mentions this issue: |
Add PidFD support, so that if the PidFD pointer in SysProcAttr is not nil, ForkExec (and thus all its users) obtains a pidfd from the kernel during clone(), and writes the result (or -1, if the functionality is not supported by the kernel) into *PidFD. The functionality to get pidfd is implemented for both clone3 and clone. For the latter, an extra argument to rawVforkSyscall is needed, thus the change in asm files. Add a trivial test case checking the obtained pidfd can be used to send a signal to a process, using pidfd_send_signal. To test clone3 code path, add a flag available to tests only. Updates #51246. Change-Id: I2212b69e1a657163c31b4a6245b076bc495777a3 Reviewed-on: https://go-review.googlesource.com/c/go/+/520266 Auto-Submit: Ian Lance Taylor <iant@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
Change https://go.dev/cl/528436 mentions this issue: |
Change https://go.dev/cl/528439 mentions this issue: |
Change https://go.dev/cl/528798 mentions this issue: |
@prattmic Opened https://go.dev/issue/62654 based on this, and wrote a partial implementation in https://go-review.googlesource.com/c/go/+/528438 and https://go-review.googlesource.com/c/go/+/528798/2?usp=related-change |
Change https://go.dev/cl/542698 mentions this issue: |
While working on CL 528798, I found out that sys.PidFD field (added in CL 520266) is not filled in when CLONE_NEWUSER is used. This happens because the code assumed that the parent and the child run in the same memory space. This assumption is right only when CLONE_VM is used for clone syscall, and the code only sets CLONE_VM when CLONE_NEWUSER is not used. Fix this, and add a test case (which fails before the fix). Updates #51246. Change-Id: I805203c1369cadd63d769568b132a9ffd92cc184 Reviewed-on: https://go-review.googlesource.com/c/go/+/542698 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Auto-Submit: Michael Pratt <mpratt@google.com>
Change https://go.dev/cl/544318 mentions this issue: |
For #51246. Change-Id: Ief2e2e14f039123a6580cb60be7ee74f4a20a649 Reviewed-on: https://go-review.googlesource.com/c/go/+/544318 TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Bryan Mills <bcmills@google.com> Run-TryBot: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
CL 520266 added pidfd_send_signal linux syscall numbers to the syscall package for the sake of a unit test. As pidfd_send_signal will be used from the os package, let's revert the changes to syscall package, add the pidfd_send_signal syscall numbers and the implementation to internal/syscall/unix, and change the above test to use it. Updates #51246. For #62654. Change-Id: I862174c3c1a64baf1080792bdb3a1c1d1b417bb4 Reviewed-on: https://go-review.googlesource.com/c/go/+/528436 Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com>
Change https://go.dev/cl/570681 mentions this issue: |
I think this could be closed now since the proposal is fully implemented as of Go 1.22. The continuation of pidfd use from golang runtime discussion happens in #62654. |
This is a continuation of CL 570036. Amend FindProcess to use pidfdFind, and make it return a special Process with Pid of pidDone (-2) if the process is not found. Amend Wait and Signal to return ErrProcessDone if pid == pidDone. The alternative to the above would be to make FindProcess return ErrProcessDone, but this is unexpected and incompatible API change, as discussed in #65866 and #51246. For #62654. Rework of CL 542699 (which got reverted in CL 566476). Change-Id: Ifb4cd3ad1433152fd72ee685d0b85d20377f8723 Reviewed-on: https://go-review.googlesource.com/c/go/+/570681 TryBot-Bypass: Dmitri Shuralyov <dmitshur@golang.org> Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Linux 5.3 added a new system call,
clone3()
, which provides a superset of the functionality of the olderclone()
. In particular, it adds a way for a child to be spawned in a different cgroup, which solves a number of issues (seeCLONE_INTO_CGROUP
flag in clone3(2) man page for details).It would be great for some software written in Go to benefit from this new functionality. Obviously, calling
clone3()
syscall directly won't work since Go runtime needs to perform specific steps around fork and exec.So, the support for clone3 needs to be in the syscall package, where
clone()
is called.Looks like this can be implemented by amending linux
SysProcAttr
structure with a pointer toCloneArgs
(a new structure that mirrors that ofC.clone_args
). Then,forkAndExecInChild
would useclone3
instead ofclone
if this pointer is non-nil. This can be used from e.g.os/exec
can setcmd.SysProcArgs.CloneArgs
.The text was updated successfully, but these errors were encountered: