-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
os: on unix Process.Kill() can kill the wrong process #13987
Comments
The spwanAndKill func is racy. The cmd.Process.Kill()
races with cmd.Wait. I think this is working as intended,
there is no way for Go to prevent this race (say, we make
Kill check if the waitpid has returned, but that is still racy,
as waitpid could finish anytime after the check and before
the kill syscall). It's the programmer's responsbility to
eliminate the race.
If you want to kill the process after a timeout, just wait
for the timeout, kill it unconditionally, and then Wait.
|
@minux On Windows the above code is not racy and works correctly due to fixes that went into #9382. On Linux I can also implement the same semantics correctly via waiting for SIGCHLD and using non-blocking wait4 syscall. That allows to synchronize with kill attempts. However, this is non-portable. What I would like to see is an option to write a portable Go code implementing the kill-if-not-finished-after-timeout pattern. If this requires new API, so be it, but at least document loudly that calling Process.Kill and Process.Wait from different threads at the same time could race and should be avoided in portable code. |
There is a portable way to implement the functionality.
Just wait for the timeout, kill the process unconditionally
and then wait.
The suggested workaround for unix is too complicated
(not to mention it needs one goroutine to pull with non-
blocking wait4) when there is easy high-level workaround.
|
I think we can do it without a race, and without an extra goroutine, if we add something to the runtime package to count SIGCHLD signals, and provide a way for a goroutine to retrieve the current SIGCHLD count and a way to sleep until the SIGCHLD counter is greater than N. |
@minux The point is precisely not to wait to timeout as typically the process finishes much faster. In my case the timeout is really for runaway cases when something bad happens or the system is too busy. I also cannot just start another go routine that does kill/wait as I need to get the process return status to decide if everything is OK. @ianlancetaylor Note I do not suggest to change the current implementation. Rather I suggest to document loudly that one should avoid Wait()/Kill() race in portable code. Then add some new API that allows to implement the desired functionality. For example, a version of exec.Cmd.Wait that takes a channel and waits until the child finishes or there is input on the channel is enough to implement all my use cases. |
If we can change the existing API so that it avoids a race, I think that is clearly better than introducing a new API. |
I still think the user code should be able to avoid this race For example, something like this should be pretty safe func spawnAndKill(exePath string, counter int) error { |
The above code just minimizes the chances of race. It is still there. As it is impossible to make its probability arbitrary small, one should avoid it, say, in production code. |
CL https://golang.org/cl/23967 mentions this issue. |
On systems that support the POSIX.1-2008 waitid function, we can use it to block until a wait will succeed. This avoids a possible race condition: if a program calls p.Kill/p.Signal and p.Wait from two different goroutines, then it is possible for the wait to complete just before the signal is sent. In that case, it is possible that the system will start a new process using the same PID between the wait and the signal, causing the signal to be sent to the wrong process. The Process.isdone field attempts to avoid that race, but there is a small gap of time between when wait returns and isdone is set when the race can occur. This CL avoids that race by using waitid to wait until the process has exited without actually collecting the PID. Then it sets isdone, then waits for any active signals to complete, and only then collects the PID. No test because any plausible test would require starting enough processes to recycle all the process IDs. Update #13987. Update #16028. Change-Id: Id2939431991d3b355dfb22f08793585fc0568ce8 Reviewed-on: https://go-review.googlesource.com/23967 Run-TryBot: Ian Lance Taylor <iant@golang.org> Reviewed-by: Austin Clements <austin@google.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
I believe this race is now fixed on GNU/Linux. Leaving the issue open as the fix will have to be implemented on other Unix systems. Systems that provide |
CL https://golang.org/cl/24021 mentions this issue. |
CL https://golang.org/cl/24020 mentions this issue. |
This change is a followup to https://go-review.googlesource.com/23967 for Darwin. Updates #13987. Updates #16028. Change-Id: Ib1fb9f957fafd0f91da6fceea56620e29ad82b00 Reviewed-on: https://go-review.googlesource.com/24020 Reviewed-by: Ian Lance Taylor <iant@golang.org>
This change is a followup to https://go-review.googlesource.com/23967 for FreeBSD. Updates #13987. Updates #16028. Change-Id: I0f0737372fce6df89d090fe9847305749b79eb4c Reviewed-on: https://go-review.googlesource.com/24021 Reviewed-by: Ian Lance Taylor <iant@golang.org>
Follow CL 23967 and CL 24021 which did the same on linux and freebsd, respectively. Updates golang#13987 Updates golang#16028 Change-Id: Ia30ef8b5cffd8f9eb75c29ee5fe350dac2be6d44
Change https://golang.org/cl/315279 mentions this issue: |
Change https://golang.org/cl/315281 mentions this issue: |
Follow CL 23967 and CL 24021 which did the same on linux and freebsd, respectively. Updates #13987 Updates #16028 Change-Id: Ia30ef8b5cffd8f9eb75c29ee5fe350dac2be6d44 Reviewed-on: https://go-review.googlesource.com/c/go/+/315279 Trust: Tobias Klauser <tobias.klauser@gmail.com> Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Follow CL 23967 and CL 24021 which did the same on linux and freebsd, respectively. Updates #13987 Updates #16028 Change-Id: I95b13d8ddde4cea1ef4fb7d655f1ad1a219d13aa Reviewed-on: https://go-review.googlesource.com/c/go/+/315281 Trust: Tobias Klauser <tobias.klauser@gmail.com> Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Change https://golang.org/cl/323489 mentions this issue: |
For #13987 For #16028 For #44513 Change-Id: I7a73446fcc80a01fa6de24eec1e5b993e543be37 Reviewed-on: https://go-review.googlesource.com/c/go/+/323489 Trust: Ian Lance Taylor <iant@golang.org> Reviewed-by: Emmanuel Odeke <emmanuel@orijtech.com>
Change https://golang.org/cl/354249 mentions this issue: |
CL 315281 changed the os package use wait6 on netbsd. This seems to be causing frequent test failures as reported in #48789. Revert that change using wait6 on netbsd for now. Updates #13987 Updates #16028 For #48789 Change-Id: Ieddffc65611c7f449971eaa8ed6f4299a5f742c2 Reviewed-on: https://go-review.googlesource.com/c/go/+/354249 Trust: Tobias Klauser <tobias.klauser@gmail.com> Trust: Bryan C. Mills <bcmills@google.com> Trust: Benny Siegert <bsiegert@gmail.com> Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> TryBot-Result: Go Bot <gobot@golang.org> Reviewed-by: Bryan C. Mills <bcmills@google.com> Reviewed-by: Benny Siegert <bsiegert@gmail.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Change https://go.dev/cl/431855 mentions this issue: |
Resend of CL 315281 which was partially reverted by CL 354249 after the original CL was suspected to cause test failures as reported in #48789. It seems that both wait4 and wait6 lead to that particular deadlock, so let's use wait6. That way we at least don't hit #13987 on netbsd. Updates #13987 For #48789 For #50138 Change-Id: Iadc4a771217b7e9e821502e89afa07036e0dcb6f Reviewed-on: https://go-review.googlesource.com/c/go/+/431855 Reviewed-by: Benny Siegert <bsiegert@gmail.com> Auto-Submit: Tobias Klauser <tobias.klauser@gmail.com> Run-TryBot: Tobias Klauser <tobias.klauser@gmail.com> Reviewed-by: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Change https://go.dev/cl/442478 mentions this issue: |
Change https://go.dev/cl/442575 mentions this issue: |
Dragonfly and FreeBSD both used numerical values for these constants chosen to be the same as on Solaris. For some reason, NetBSD did not, and happens to interpret value 0 as P_ALL instead of P_PID (see https://github.com/NetBSD/src/blob/3323ceb7822f98b3d2693aa26fd55c4ded6d8ba4/sys/sys/idtype.h#L43-L44). Using the correct value for P_PID should cause wait6 to wait for the correct process, which may help to avoid the deadlocks reported in For #50138. Updates #13987. Change-Id: I0eacd1faee4a430d431fe48f9ccf837f49c42f39 Reviewed-on: https://go-review.googlesource.com/c/go/+/442478 Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Benny Siegert <bsiegert@gmail.com> Run-TryBot: Bryan Mills <bcmills@google.com>
There are getting to be enough special cases in this wrapper that the increase in clarity from having a single file is starting to be outweighed by the complexity from chained conditionals. Updates #50138. Updates #13987. Change-Id: If4f1be19c0344e249aa6092507c28363ca6c8438 Reviewed-on: https://go-review.googlesource.com/c/go/+/442575 Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
Dragonfly and FreeBSD both used numerical values for these constants chosen to be the same as on Solaris. For some reason, NetBSD did not, and happens to interpret value 0 as P_ALL instead of P_PID (see https://github.com/NetBSD/src/blob/3323ceb7822f98b3d2693aa26fd55c4ded6d8ba4/sys/sys/idtype.h#L43-L44). Using the correct value for P_PID should cause wait6 to wait for the correct process, which may help to avoid the deadlocks reported in For golang#50138. Updates golang#13987. Change-Id: I0eacd1faee4a430d431fe48f9ccf837f49c42f39 Reviewed-on: https://go-review.googlesource.com/c/go/+/442478 Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Benny Siegert <bsiegert@gmail.com> Run-TryBot: Bryan Mills <bcmills@google.com>
There are getting to be enough special cases in this wrapper that the increase in clarity from having a single file is starting to be outweighed by the complexity from chained conditionals. Updates golang#50138. Updates golang#13987. Change-Id: If4f1be19c0344e249aa6092507c28363ca6c8438 Reviewed-on: https://go-review.googlesource.com/c/go/+/442575 Run-TryBot: Bryan Mills <bcmills@google.com> TryBot-Result: Gopher Robot <gobot@golang.org> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Ian Lance Taylor <iant@google.com>
Change https://go.dev/cl/528439 mentions this issue: |
Change https://go.dev/cl/528798 mentions this issue: |
Change https://go.dev/cl/528438 mentions this issue: |
Use Process.handle field to store pidfd, and make use of it. Only use pidfd functionality if all the needed syscalls are available. 1. StartProcess: obtain the pidfd from the kernel, if available, using the functionality added by CL 520266. Note we could not modify syscall.StartProcess to return pidfd directly because it is a public API and its callers do not expect it, so we have to use ensurePidfd and getPidfd. 2. (*Process).Kill: use pidfdSendSignal, if the syscall is available and pidfd is known. This is slightly more complicated than it should be, since the syscall can be blocked by e.g. seccomp security policy, therefore the need for a function to check if it's actually working, and a soft fallback to kill. Perhaps this precaution is not really needed. 3. (*Process).Wait: use pidfdWait, if available, otherwise fall back to using waitid/wait4. This is also more complicated than expected due to struct siginfo_t idiosyncrasy. NOTE pidfdSendSignal and pidfdWait are used without a race workaround (blockUntilWaitable and sigMu, added by CL 23967) because with pidfd, PID recycle issue doesn't exist (IOW, pidfd, unlike PID, is guaranteed to refer to one particular process) and thus the race doesn't exist either. For #62654. Updates #13987. Change-Id: I22ebcc7142b16a3a94c422d2f32504d1a80e8a8f Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com> Reviewed-on: https://go-review.googlesource.com/c/go/+/528438 TryBot-Result: Gopher Robot <gobot@golang.org> Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Ian Lance Taylor <iant@golang.org> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Change https://go.dev/cl/570036 mentions this issue: |
Use Process.handle field to store pidfd, and make use of it. Only use pidfd functionality if all the needed syscalls are available. 1. Add/use pidfdWorks, which checks that all needed pidfd-related functionality works. 2. os.StartProcess: obtain the pidfd from the kernel, if possible, using the functionality added by CL 520266. Note we could not modify syscall.StartProcess to return pidfd directly because it is a public API and its callers do not expect it, so we have to use ensurePidfd and getPidfd. 3. (*Process).Kill: use pidfdSendSignal, if available and the pidfd is known. Otherwise, fall back to the old implementation. 4. (*Process).Wait: use pidfdWait, if available, otherwise fall back to using waitid/wait4. This is more complicated than expected due to struct siginfo_t idiosyncrasy. NOTE pidfdSendSignal and pidfdWait are used without a race workaround (blockUntilWaitable and sigMu, added by CL 23967) because with pidfd, PID recycle issue doesn't exist (IOW, pidfd, unlike PID, is guaranteed to refer to one particular process) and thus the race doesn't exist either. Rework of CL 528438 (reverted in CL 566477 because of #65857). For #62654. Updates #13987. Change-Id: If5ef8920bd8619dc428b6282ffe4fb8c258ca224 Reviewed-on: https://go-review.googlesource.com/c/go/+/570036 TryBot-Result: Gopher Robot <gobot@golang.org> Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Cherry Mui <cherryyz@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
This is somewhat similar to the issue #9382. On Unix Process.Kill() just calls Process.Signal(Kill). As https://golang.org/src/os/exec_unix.go#L39 indicates, the Signal function invokes syscall.Kill(p.Pid) outside any lock after checking if the process still runs. Thus at the point when the signal is called the process could finish and Process.Wait() from another thread return. Thus OS is free to reuse the pid for another unrelated process. If this happens, Process.Kill() kills that process.
Due to this race it is impossible to write a correct platform-independent code in Go that kills the process if it does not terminate after a pause. I.e. the code fragment from #9382 is not correct on Unix:
The text was updated successfully, but these errors were encountered: