Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: Wait can wait on the wrong process #67642

Open
prattmic opened this issue May 24, 2024 · 2 comments
Open

os: Wait can wait on the wrong process #67642

prattmic opened this issue May 24, 2024 · 2 comments
Assignees
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@prattmic
Copy link
Member

prattmic commented May 24, 2024

In #13987, we fixed the race between Wait and Signal where Wait frees the PID in the OS, which can then reuse it before Signal sends a signal to what is now a different process.

The same race applies to Wait + Wait (ignoring pidfd): https://cs.opensource.google/go/go/+/master:src/os/exec_unix.go;l=41-54;drc=dbe2e757bb55f80de1a622da6bd5060e979208d1

In two concurrent calls to Wait, both will see the process as waitable in blockUntilWaitable. Both then proceed unconditionally to the wait syscall. One of these calls will win and get the wait status. The other, if slow and the OS has reused the PID, will wait on a different process with the same PID.

I believe this race would be fixed by replacing the unconditional setDone with a compare and swap that bails out if isdone is already set.

cc @ianlancetaylor @golang/runtime

@prattmic prattmic added this to the Go1.23 milestone May 24, 2024
@prattmic prattmic self-assigned this May 24, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/588675 mentions this issue: os: overhaul handling of PID vs pidfd within Process

@prattmic
Copy link
Member Author

I believe this race would be fixed by replacing the unconditional setDone with a compare and swap that bails out if isdone is already set.

Upon further thought, I don't think this is sufficient. We could protect the second wait from waiting on the wrong PID, but the blockUntilWaitable wait could still wait on the wrong PID.

@prattmic prattmic modified the milestones: Go1.23, Backlog May 28, 2024
gopherbot pushed a commit that referenced this issue Jun 10, 2024
There are several issues with pidfd handling today:

* The zero value of a Process makes the handle field appear valid, so
  methods attempt to use it as a pidfd rather than falling back to the
  PID as they should (#67634).

* If a process doesn't exist, FindProcess returns a Process with Pid ==
  -2, which is not a compatible change (#67640).

* pidfd close is racy as-is. A Release call or successful Wait will
  clear the handle field and close the pidfd. However, a concurrent call
  may have already loaded the handle field and could then proceed to use
  the closed FD (which could have been reopened as a different pidfd,
  targeting a different process) (#67641).

This CL performs multiple structural changes to the internals of
Process.

First and foremost, each method is refactored to clearly select either
pidfd or raw pid mode. Previously, raw pid mode was structured as a
fallback when pidfd mode is unavailable. This works fine, but it does
not make it clear that a given Process object either always uses pidfd
or always uses raw pid. Since each mode needs to handle different race
conditions, it helps to make it clear that we can't switch between modes
within a single Process object.

Second, pidfd close safety is handled by reference counting uses of the
FD. The last user of the FD will close the FD. For example, this means
that with concurrent Release and Signal, the Signal call may be the one
to close the FD. This is the bulk of this CL, though I find the end
result makes the overall implementation easier to reason about.

Third, the PID path handles a similar race condtion between Wait and
Kill: Wait frees the PID value in the kernel, which could be reallocated
causing Kill to target the wrong process. This is handled with a done
flag and a mutex. The done flag now shares the same state field used for
the handle.

Similarly, the Windows implementation reuses all of the handle reference
counting that Linux uses. This means the implementations more
consistent, and make Windows safe against the same handle reuse
problems. (Though I am unsure if Windows ever reuses handles).

Wait has a slight behavior change on Windows: previously Wait after
Release or an earlier Wait would hang indefinitely (WaitForSingleObject
on syscall.InvalidHandle waits indefinitely). Now it returns the same
errors as Linux (EINVAL and ErrProcessDone, respectively).

Similarly, Release on Windows no longer returns close errors, as it may
not actually be the place where the close occurs.

Fixes #67634.
Fixes #67640.
Fixes #67641.
Updates #67642.

Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest,gotip-windows-amd64-longtest
Change-Id: I2ad998f7b67d32031e6f870e8533dbd55d3c3d10
Reviewed-on: https://go-review.googlesource.com/c/go/+/588675
Reviewed-by: Austin Clements <austin@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@seankhliao seankhliao added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
None yet
Development

No branches or pull requests

3 participants