Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

os: crashes due to new error from FindProcess (CL 542699) #65866

Closed
stgraber opened this issue Feb 22, 2024 · 22 comments
Closed

os: crashes due to new error from FindProcess (CL 542699) #65866

stgraber opened this issue Feb 22, 2024 · 22 comments
Labels
NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Milestone

Comments

@stgraber
Copy link

Go version

go version devel go1.23-5d4e8f5 Thu Feb 22 01:57:32 2024 +0000 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/home/stgraber/.cache/go-build'
GOENV='/home/stgraber/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/home/stgraber/data/code/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/home/stgraber/data/code/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/home/stgraber/sdk/gotip'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/home/stgraber/sdk/gotip/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.23-5d4e8f5 Thu Feb 22 01:57:32 2024 +0000'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-I/home/stgraber/data/code/go/deps/raft/include/ -I/home/stgraber/data/code/go/deps/cowsql/include/'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-L/home/stgraber/data/code/go/deps/raft/.libs -L/home/stgraber/data/code/go/deps/cowsql/.libs/'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build3052341145=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Following https://pkg.go.dev/os#FindProcess, I have logic which assumes that on Linux systems, os.FindProcess will always succeed and that the resulting record can then be used with Signal to validate its actual existence.

However it appears the recent switch to using PIDFD within Go may have regressed this?
@kolyshkin

Example code:

package main

import (
	"fmt"
	"os"
	"syscall"
)

func main() {
	// Go doc says this will always work on Linux.
	pr, err := os.FindProcess(12345678)
	if err != nil {
		fmt.Printf("Supposedly unreachable code reached\n")
		os.Exit(1)
	}

	// Check if process exist by sending signal 0.
	err = pr.Signal(syscall.Signal(0))
	if err != nil {
		fmt.Printf("Process doesn't exist\n")
		os.Exit(0)
	}

	fmt.Printf("Process exists\n")
}

What did you see happen?

stgraber@castiana:~$ go version
go version go1.21.5 linux/amd64
stgraber@castiana:~$ go run ./main.go 
Process doesn't exist
stgraber@castiana:~$ 

What did you expect to see?

stgraber@castiana:~$ gotip version
go version devel go1.23-5d4e8f5 Thu Feb 22 01:57:32 2024 +0000 linux/amd64
stgraber@castiana:~$ gotip run ./main.go 
Supposedly unreachable code reached
exit status 1
stgraber@castiana:~$ 
@prattmic
Copy link
Member

This behavior was intentionally changed in https://go.dev/cl/542699, merged just yesterday, under the belief that almost no code would depend on the old behavior. GODEBUG=osfinderr=0 was added to switch to the old behavior if a tiny number of programs do depend on the old behavior.

That said, receiving a report of breakage mere hours after merging the CL (thank you for proactively testing tip!) makes me suspect that dependence is more widespread than expected and that we should reconsider changing this behavior.

Could you share more about your original code that broke because of this change? Based on the issue title, I'm guessing the code did not check the error from FindProcess at all (since the docs said it would always succeed on Unix), and thus got a nil dereference when calling Process.Signal?

cc @ianlancetaylor @kolyshkin

@prattmic
Copy link
Member

If we do need to change behavior, I see a few options:

  1. Along with an error, also return a dummy "error" os.Process, which makes Process.Signal, etc unconditionally return an error.
  2. Same as (1), but do not return an error, only the dummy "error" os.Process.

(1) would catch users that did not do error checking on FindProcess, but still allow early return for those that do. However, it could still break users if their code behaves differently on early error. (e.g., there is some extra cleanup that they do when Process.Signal fails that they don't do (but should) when FindProcess fails). (2) would fix those cases as well.

Neither case will work if users are legitimately depending on the racy behavior. e.g., they call FindProcess(1234) at startup and then occasionally poll Process.Signal to determine when process 1234 exists. I can't think of legitimate uses for this, but if they exist then I think the only option would be to document FindProcess as explicitly racy and introduce a new non-racy API.

@stgraber would any/all of these alternatives work for your application?

@stgraber
Copy link
Author

stgraber commented Feb 22, 2024

Could you share more about your original code that broke because of this change? Based on the issue title, I'm guessing the code did not check the error from FindProcess at all (since the docs said it would always succeed on Unix), and thus got a nil dereference when calling Process.Signal?

That's correct, our code was calling FindProcess and then unconditionally call .Signal on the returned value. As this can now be nil, we ended up with a panic...

In our case I've simply sent a PR which now properly handles the err value of FindProcess so we'll be fine moving forward.

This unfortunately will not help anyone building any of our current stable releases with Go 1.23 as anyone doing that will be getting confusing failures.
(This isn't terribly likely in our case as we don't support our releases long enough for that to be an issue, but we may not be the only ones who followed that docstring and just never bothered even checking err for code that can only ever run on unix/linux).

@stgraber
Copy link
Author

For us, 1) would work, so long as the error we get from Signal for that dummy error process ends up being os.ErrProcessDone.

I think that'd be the best option as anyone doing proper error checking on FindProcess (as we do now) will get the early result and save themselves the call to Signal. Anyone else who took the old docstring to the letter will still get the old behavior when calling Signal and crucially, no panics :)

What I'd really like to avoid here is other folks with code similar to ours out there, where the user decides to rebuild the application with the latest Go release. They'd get no compile time error at all and would have a resulting binary that appears to work fine until something triggers the code path that hits FindProcess+Signal and panics the whole thing.
That would be rather tricky for folks to debug and depending on the codebase, could be exploited by a remote unprivileged user as a way to DoS the system.

@bcmills
Copy link
Contributor

bcmills commented Feb 22, 2024

I don't see a proposal associated with https://go.dev/cl/542699, and it seems like that sort of incompatible change really ought to have one. I suggest that we revert it and send the change through the proposal process to work out the compatibility implications in more detail.

@prattmic
Copy link
Member

#62654 was originally a proposal, but it was taken out of the proposal process because it wasn't considered "a significant functionality change". Perhaps we should have put it back when we decided it was big enough to add a GODEBUG option.

@kolyshkin
Copy link
Contributor

If we do need to change behavior, I see a few options:

  1. Along with an error, also return a dummy "error" os.Process, which makes Process.Signal, etc unconditionally return an error.
  2. Same as (1), but do not return an error, only the dummy "error" os.Process.

(1) would catch users that did not do error checking on FindProcess, but still allow early return for those that do. However, it could still break users if their code behaves differently on early error. (e.g., there is some extra cleanup that they do when Process.Signal fails that they don't do (but should) when FindProcess fails). (2) would fix those cases as well.

I think that (1) is the way to go; I'll work on that.

@kolyshkin
Copy link
Contributor

I think that (1) is the way to go; I'll work on that.

https://go-review.googlesource.com/c/go/+/566179

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/566179 mentions this issue: os: FindProcess must not return nil

@kolyshkin
Copy link
Contributor

@stgraber thanks for reporting that! I guess the scenario is quite common and it totally slipped my mind.

@kolyshkin
Copy link
Contributor

@prattmic here is one more alternative to 1. and 2. proposed above. Best expressed as a diff:

 func findProcess(pid int) (p *Process, err error) {
        h, err := pidfdFind(pid)
        if err == ErrProcessDone {
-               return nil, err
+               return newProcess(pid, unsetHandle), err
        }
        // Ignore all other errors from pidfdFind,

This way, for users who do not check for errors from os.FindProcess will see the exactly same behavior as before, and those who do may see ErrProcessDone error.

OTOH this is worse than your proposal 1 (implemented in https://go.dev/cl/566179) because it won't hint the users to start checking for errors.

@prattmic
Copy link
Member

Given that the best path forward for this issue isn't obvious and we have test failures as well (#65857), I think we should revert these CLs until we decide the correct approach (and fix the tests). I will send reverts.

@prattmic prattmic changed the title os: panic on Signal os: crashes due to new error from FindProcess (CL 542699) Feb 23, 2024
@prattmic prattmic added this to the Go1.23 milestone Feb 23, 2024
@prattmic prattmic added release-blocker NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels Feb 23, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/566476 mentions this issue: Revert "os: make FindProcess use pidfd on Linux"

@prattmic
Copy link
Member

prattmic commented Feb 27, 2024

I've run the pidfd CLs through the Google internal Go tests. I haven't fully analyzed all failures yet, but thus far I've seen failures in three categories:

  1. Missing error checks: code that ignores the error return from FindProcess entirely. runc contains an open source case of this.
  2. Different semantics when FindProcess returns an error rather than a later call. psutil contains an open source case of this. To be compatible with this change, the function here would need to return false, nil if err == os.ErrProcessDone, as it does after the Process.Signal call.
  3. Variety of "bad file descriptor" errors. These may be related to all: failures with EBADF on Linux-based platforms since 2024-02-21 #65857.

@prattmic
Copy link
Member

In a fun case of (2), one internal package actually constructs a new error containing the text "this shouldn't be possible on unix systems" when FindProcess returns an error. 😆

@prattmic
Copy link
Member

prattmic commented Feb 27, 2024

Aha, one more failure has manual construction of os.Process. e.g.,

p := os.Process{Pid: pid}
p.Kill()

I suspect this may be the cause of all of the "bad file descriptor" failures, as Kill, etc will use the zero value of handle, FD 0 as the "pidfd". Fixing this may allow rolling forward with https://go.dev/cl/528438.

@prattmic
Copy link
Member

I finished my analysis and didn't find more failures than those listed above.

From what I've seen, I believe option 2 in #65866 (comment) would be OK for all of the failing tests I saw. i.e., os.FindProcess never returns an error, but makes subsequent Kill, etc calls unconditionally error.

@kolyshkin
Copy link
Contributor

kolyshkin commented Mar 8, 2024

@prattmic thank you for your analysis. I think there are two separate issues here.

I. Closing fd 0

As for using fd 0 as pidfd value (causing test failures in #65857), this was caused by a stupid bug of mine (see here.

This is now fixed in CL 570036. Unfortunately, LUCI gotip-android apparently can't be tested.

Now, the only scenario of closing fd 0 I can think of is this:

p := os.Process{Pid: pid}
p.Release()

Alas, I can't think of a good way to protect from it (without adding something like atomic.Bool pidfdSet to struct Process, or treating 0 as an invalid fd in pidfdRelease, that is).

In all other cases, Process.Release is called as a finalizer which is set in newProcess. I have reviewed all callers of newProcess inCL 570036 to make sure Process.handle (pidfd) is properly set to either -1/unsetHandle or the real pidfd.

II. Backward-compatible os.FindProcess interface

It boils down to the question of what to do in FindProcess when pidfdFind returned ErrProcessDone. I think we have three choices here:

(1) Do something similar to what CL 542699 plus CL 566179 did, meaning:

  • in FindProcess, return an empty "uninitialized" Process structure (with Pid set to 0) and ErrProcessDone (or other similar) error;
  • make sure Signal and Wait return "os: process not initialized" error on such Process;
  • have the GODEBUG osfinderr=0 setting in place to be able to revert to old behavior.

This way, users can switch to the new behavior of FindProcess relatively painlessly.

(2) Do not ever return errors from FindProcess:

  • in FindProcess, return a "process gone" Process structure (with Pid set of e.g. -2) and no error;
  • make sure Signal and Wait return an error appropriate for when the process is not here (currently, Wait returns NewSyscallError("waitid", syscall.ECHILD) or NewSyscallError("wait", syscall.ECHILD), and Signal/Kill return os.ErrProcessDone);
  • do not add new GODEBUG setting.

In essense, these two are identical to your (1) and (2) from the earlier comment. Both approaches have their pros and cons.

@kolyshkin

This comment was marked as outdated.

@kolyshkin

This comment was marked as duplicate.

@kolyshkin

This comment was marked as outdated.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/570681 mentions this issue: os: make FindProcess use pidfd on Linux

@dmitshur dmitshur added NeedsFix The path to resolution is known, but the work has not been done. and removed NeedsDecision Feedback is required from experts, contributors, and/or the community before a change can be made. labels May 21, 2024
gopherbot pushed a commit that referenced this issue May 21, 2024
This is a continuation of CL 570036.

Amend FindProcess to use pidfdFind, and make it return a special
Process with Pid of pidDone (-2) if the process is not found.

Amend Wait and Signal to return ErrProcessDone if pid == pidDone.

The alternative to the above would be to make FindProcess return
ErrProcessDone, but this is unexpected and incompatible API change,
as discussed in #65866 and #51246.

For #62654.

Rework of CL 542699 (which got reverted in CL 566476).

Change-Id: Ifb4cd3ad1433152fd72ee685d0b85d20377f8723
Reviewed-on: https://go-review.googlesource.com/c/go/+/570681
TryBot-Bypass: Dmitri Shuralyov <dmitshur@golang.org>
Run-TryBot: Kirill Kolyshkin <kolyshkin@gmail.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsFix The path to resolution is known, but the work has not been done. release-blocker
Projects
None yet
Development

No branches or pull requests

6 participants