-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime/debug: add SetCrashOutput(file *os.File) #42888
Comments
As a data point, the gVisor project wants to redirect panic/throw output. It works around this problem by dup'ing over stderr with the FD of the desired crash output location (https://cs.opensource.google/gvisor/gvisor/+/master:runsc/cli/main.go;l=194-198?q=dup3&ss=gvisor%2Fgvisor). stderr is otherwise unused by the program, so this works OK. This is a workable solution, but it would certainly be nicer to be able to change the destination directly. |
Thinking outloud, the runtime prints other output besides just crashes, such as what one can enable via GODEBUG. Should that be included in this option? |
I'd think (initially, at least), yes. I'd probably call this something like On the other hand, there are nice opportunities if a separate "fatal" FD could be provided that only includes fatal panic/throw output, plus any related prints immediately before a throw. That would allow a nice simplification of "any bytes received on this FD indicates a crash". But it means extra complexity for a rather edge case [1]. [1] I can't think of any non-fatal runtime output that isn't behind a GODEBUG option. |
I would say, yes, to including other runtime output as well. If there's a need for additional separation it can be always introduced later. But, I wouldn't be also opposed to introducing it immediately either. I think the fatal output can be detected when the |
@egonelbre how do you propose to use your API on Windows? I agree. I wanted this feature to capture crash dumps of a Windows service. /cc @zx2c4 Alex |
I've handled this until now with defer/recover in each go routine, which is really very subpar and doesn't handle all crashes. Rather than setting an output fd, what about setting an output callback? There'd be certain restrictions on what it could do, obviously, but it could still probably accomplish most things that people want. In my case, it'd just copy the bytes to some mmap'd ring buffer log file. |
@alexbrainman it would behave in a similar manner that you can provide a @zx2c4 I believe the callback has been proposed a few times, however I cannot find the exact proposals. I think the fundamental issue is that you don't know really much about the crashed system - what's broken and what's not. e.g. maybe you got a panic during sigprof where some code holds a runtime lock "x", which your code needs and you'll deadlock when calling |
Right. If you're using
Writing to event log isn't too bad. It'd support the callback-based approach I mentioned. |
This means no heap allocation, no stack splits, no map access (IIRC), particularly if you want this to cover runtime |
But also not so out of place for something in debug, right? We're offering some way to hook into runtime internals, and with that comes the responsibilities of being in the runtime. But maybe there's a better idea here: The other way of doing this might be to have an unstable and unexported function //go:linkname setCrashHook runtime.setCrashHook That then represents the lowest level. At a higher level, then the debug package can use that to implement things like Meanwhile, insane users like me can dip down into |
The restrictions on an output callback would be severe, as @prattmic says, and failure modes would be unpredictable. In Go we try to avoid that kind of deeply unsafe operation. I think it would be better to keep this issue focused on the much safer operation of setting a file descriptor. That said, I don't understand what would happen if the descriptor is a pipe that nothing is reading from. Or a file opened on a networked file system that is not responding. What should do in cases like that? |
What do you think of my proposal above of allowing this to be a general callback via go:linkname -- i.e. just for libraries? That's a mere "implementation detail", but it'd go a long way of enabling this to be extensible for the adventurous. |
I'm not sure we need to worry about those (beyond documentation, perhaps). stderr could already be any of those kinds of descriptors, as set by the parent process, so the same problems would exist today. |
I guess this is the main danger with the proposal. As @prattmic mentioned, somebody could pipe stderr to somewhere that isn't being read. I didn't test it, but I think that would block in the same way. Using a non-blocking write that drops data when the fd is not responding would be nice, however I suspect that would be difficult to implement. It'll definitely need examples how to write the listening side. Other than that, I don't have any further ideas. |
This might be still difficult due to the constraints, but if the writing is on a different thread and the previous write hasn't finished in appropriate time, the write can be dropped. The threshold could be configured via GODEBUG or similar. |
Should we just have GODEBUG=debugfd=3? /cc @aclements |
@zx2c4 I'm not fond of documenting a callback to use with |
I'd be okay with GODEBUG=debugfd=3. It's not straightforward to implement, though, since we don't distinguish between runtime output and anything else printed by I agree with @alexbrainman that I'm not sure how this would work on Windows. Internally, we treat any "fd" other than 1 or 2 as a raw Windows handle value, but I don't know if such a thing can be meaningfully passed as an environment variable. Is it possible for a parent process to create a handle and pass it to a child like this? On the topic of a callback, in addition to the subtle and rather onerous restrictions on the callback, those restrictions can also change with new releases. We don't expose anything remotely like that right now. |
I did think about GODEBUG, but I suspect that would be complicated to use in Windows services since you don't have a nice way to figure out the fd prior to starting it. The main process needs to respond to messages (https://github.com/golang/sys/blob/ef89a241ccb347aa16709cf015e91af5a08ddd33/windows/svc/example/service.go#L23).
|
@egonelbre I don't understand what you are suggesting. If I have a Windows service written in Go, how can I redirect its crash dump to a file by using Similarly, I don't see how using Maybe we can use file path instead of file descriptor? Like Alex |
@alexbrainman , for writing to a file, I imagine you would open the file and then pass that handle to
Is there a reason this can't be done the other way around? The first process is the monitor process and opens the FD, then starts the second process with the (I'm not necessarily opposed to having a |
Yea, documenting it is probably the wrong thing to do. But having something in there akin to nanotime would be "very nice"...and would allow me to implement something for Go's Windows service package that uses it to ship things to eventlog automatically. Anyway, I realize pushing this too hard is an loosing battle, for good reason. But if somehow the implementation happened to be structured in a way that made it coincidentally possible to go:linkname it from elsewhere, I would be very happy. |
You can make a handle inheritable, and then pass its value as an argument to the new process. The value will remain the same in the new process. Alternatively, DuplicateHandleEx allows the one process to duplicate a handle into another process. |
@alexbrainman, so roughly what I'm thinking is the following. However, I might also forget some internal details that could make the following difficult with Windows. The most basic thing is to send the output to a separate file:
The next step would be to create a watchdog sub-process:
You could also create governors using ExtraFiles:
For windows, I think (named) pipes could also be used:
|
Change https://go.dev/cl/547978 mentions this issue: |
Change https://go.dev/cl/548121 mentions this issue: |
Change https://go.dev/cl/548735 mentions this issue: |
This change adds recover() + bug.Reportf to the generated dispatch routines for the server and client, so that we learn through telemetry of unexpected panics in RPC goroutines. Panics in other goroutines cannot yet be reported, but go1.23's runtime/debug.SetCrashOutput feature (golang/go#42888) will make that possible, with some extra work. Change-Id: I893333d296096632fa47e3238a72b0ac48386375 Reviewed-on: https://go-review.googlesource.com/c/tools/+/548735 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Robert Findley <rfindley@google.com>
Change https://go.dev/cl/558256 mentions this issue: |
Change https://go.dev/cl/559503 mentions this issue: |
Change https://go.dev/cl/559800 mentions this issue: |
Change https://go.dev/cl/559801 mentions this issue: |
This change defines a crash monitor that increments a telemetry counter for the stack any time the Go runtime crashes. It relies on the runtime/debug.SetCrashOutput feature of go1.23 (#42888); if built with an earlier version of Go, the monitor is a no-op. This feature causes the Go runtime to write the crash to an arbitrary file instead of stderr. In this case, the file is a pipe to a subprocess that executes the application executable in monitor mode. When the pipe is closed, the monitor parses the program counters out of the backtrace, then uses a the usual logic to convert it to the name of a counter, which it then increments. The CL includes two tests, an internal test that parses a crash from the child process's stderr, and an integration test that uses crashmonitor.Start and SetCrashOutput, if available. Updates golang/go#42888 Change-Id: Ie40f9403fa99a9e0bab3b4edc9430be5e18150d7 Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/559503 Auto-Submit: Alan Donovan <adonovan@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Robert Findley <rfindley@google.com>
This change enables the crash monitor in gopls, so that it increments a telemetry counter for the stack any time the Go runtime crashes. x/telemetry cannot be built with go1.18, which gopls still supports (though not for long). So we must ensure that there are no direct references to x/telemetry from gopls; all references must go through the build-tagged non-functional shims, which avoids the dependency. Also, internal/telemetry can no longer depend on protocol, to avoid a cycle via protocol -> bug -> telemetry. Updates golang/go#42888 Change-Id: I7a57b9a93ab76a58ec6dd73f895d4b42ed29ea86 Reviewed-on: https://go-review.googlesource.com/c/tools/+/558256 Reviewed-by: Robert Findley <rfindley@google.com> Reviewed-by: Cherry Mui <cherryyz@google.com> Auto-Submit: Alan Donovan <adonovan@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Updates #42888 Change-Id: I72e408301e17b1579bbea189bed6b1a0154bd55f Reviewed-on: https://go-review.googlesource.com/c/go/+/548121 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
This feature makes it possible to record unhandled panics in any goroutine through a watchdog process (e.g. the same application forked+exec'd as a child in a special mode) that can process the panic report, for example by sending it to a crash-reporting system such as Go telemetry or Sentry. Fixes golang#42888 Change-Id: I5aa7be8f726bbc70fc650540bd1a14ab60c62ecb Reviewed-on: https://go-review.googlesource.com/c/go/+/547978 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Alan Donovan <adonovan@google.com> Reviewed-by: Russ Cox <rsc@golang.org>
Updates golang#42888 Change-Id: I72e408301e17b1579bbea189bed6b1a0154bd55f Reviewed-on: https://go-review.googlesource.com/c/go/+/548121 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
SetCrashOutput dup's the input file for safety, but I don't think that the docs are very clear about what the caller can/should do with f. "it does not close the previous file" is particularly confusing, as it does close the previous FD (but not the previous passed os.File). Expand and attempt to clarify the explanation, borrowing wording from net.FileConn, which also dup's the input os.File. For #42888. Change-Id: I1c96d2dce7899e335d8f1cd464d2d9b31aeb4e5e Reviewed-on: https://go-review.googlesource.com/c/go/+/559800 Reviewed-by: Alan Donovan <adonovan@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
There are two separate cases here: The base case is simple: a concurrent call to SetCrashOutput while panicking will switch the crash FD, which could cause the first half of writes to go to the old FD, and the second half to the new FD. This isn't a correctness problem, but would be annoying to see in practice. Since it is easy to check for, I simply drop any changes if panicking is already in progress. The second case is more important: SetCrashOutput will close the old FD after the new FD is swapped, but writeErrData has no locking around use of the fd, so SetCrashOutput could close the FD out from under writeErrData, causing lost writes. We handle this similarly, by not allowing SetCrashOutput to close the old FD if a panic is in progress, but we have to be more careful about synchronization between writeErrData and setCrashFD to ensure that writeErrData can't observe the old FD while setCrashFD allows close. For #42888. Change-Id: I7270b2cc5ea58a15ba40145b7a96d557acdfe842 Reviewed-on: https://go-review.googlesource.com/c/go/+/559801 Reviewed-by: Alan Donovan <adonovan@google.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@adonovan I had been watching for this API to become available, though having used it I'm a bit disappointed.
Can I suggest we either
Or am I just missing something? I have no need for a separate monitor process - I think a process should be able to catch its own crash output. |
The file modification time records the time of the last write to the file, so you should be able to infer the crash time from it.
Indeed, you will need to create a file for every process, and (one hopes) the vast majority will be empty. Presumably you will create them in a directory other than /tmp (where their directory entries would expire even while the process is still running, at least on Unix). Whatever program of yours monitors this directory can ignore the empty files and clean up however it sees fit. Also, your cleanupCrashlog function is allowed to call SetCrashOutput(nil), after which it should be possible to remove the log file even on Windows. |
For the last crash only
I really do feel though this feature missed the mark. At least if SetCrashOutput() received an interface instead of os.File I feel like there'd be more opportunity to customize the behaviour. As you say, I suppose my process will have to self prune these empty crash logs...
There it is! I can make that work - thank you! |
The directory is never going to be entirely empty because each running process will need a corresponding file. So either way, you would need to change this predicate to "Directory of only empty files == all good".
There's no way that the runtime can make a dynamic call through an interface while in the middle of a crashing. Observe that the implementation must write directly to a file descriptor; it doesn't use os.File.Write. That's the reason it's not an io.Writer. |
Thank you for the responses, I appreciate your time on this.
Sorry, I should have been clearer. I was referring to a manual, visible inspection of said directory. Which I think you'll agree is still easier to do on a single file (that's required for a running process) vs 1000s of leftover files
I still think this makes a case for a future CrashOptions{} field to remove the crash file if it is empty. |
Agreed, but that sounds like a tedious manual job best automated and delegated to a computer. ;-) And there's no getting around the need for each running process to have an empty file. If the program acted as its own crash monitor (using a pipe to a child process---see ExampleSetCrashOutput), it could defer the opening of a file until the monitored program actually crashed.
There's no way for the runtime to implement that behavior. It would require the runtime to somehow know that the process was never going to crash. The only code that knows that is your main function just before it calls os.Exit(0), assuming there are no other goroutines running. Even then, many long lived programs that you might want to monitor never voluntarily terminate: healthy web servers are often killed by the termination of their container. |
Currently there are difficulties with nicely capturing crash output from Go programs. Crashes print to stderr, however, usually both stdout and stderr are used by programs for other purposes. While it's not wrong to output it to stderr, it mixes two outputs together making later separation more difficult.
Capturing crashes is useful for post-mortem debugging and sending reports. Of course, due to unknown crashed program state using a custom callback is understandably dangerous.
A simpler alternative would be to allow changing file descriptor where to print crashes. In the simplest form it could look like:
This would allow passing in a
fd
for separate file, pipe to another "crash monitor" process or connect to an external server. Of course there's a slight possibility ofwrite
taking a long time, when passing in a pipe/connection slowing down the crashing of the program.With regards to naming it could also be
SetCrashFD
,SetTracebackFD
orSetTracebackOutputFD
.The text was updated successfully, but these errors were encountered: