Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: CallersFrame returns incorrect frame for panics in inlined frame #70637

Closed
adonovan opened this issue Dec 2, 2024 · 28 comments
Closed
Assignees
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@adonovan
Copy link
Member

adonovan commented Dec 2, 2024

We've noticed a number of recent gopls panics in which the leaf function in the backtrace doesn't appear to contain a fallible operation.

I wonder whether the pclntab is incorrect in these cases.

@findleyr

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Dec 2, 2024
@mknyszek mknyszek added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Dec 2, 2024
@mknyszek
Copy link
Contributor

mknyszek commented Dec 2, 2024

CC @golang/compiler

@mknyszek mknyszek added this to the Backlog milestone Dec 2, 2024
@findleyr
Copy link
Member

findleyr commented Dec 2, 2024

The most stark example of this is https://go.dev/issue/70634, where the panic is immediately preceded by a nil check.

In that case, there is a plausible panic, on the same line (see my comment), which is correctly reported in #70636. Perhaps that provides a clue.

@prattmic
Copy link
Member

prattmic commented Dec 4, 2024

For cases that don't appear to have a fallable operation at all, perhaps we could build the exact binary and dump out all of the PCs that report this location to see if anything there looks fallable.

@prattmic
Copy link
Member

prattmic commented Dec 4, 2024

The most stark example of this is https://go.dev/issue/70634, where the panic is immediately preceded by a nil check.

In that case, there is a plausible panic, on the same line (see my comment), which is correctly reported in #70636. Perhaps that provides a clue.

In that case, one panic (the plausible one) is goPanicIndex while the other is panicmem, so it seems unlikely that these are the exact same panic but with different locations.

@adonovan
Copy link
Member Author

adonovan commented Dec 4, 2024

For cases that don't appear to have a fallible operation at all, perhaps we could build the exact binary and dump out all of the PCs that report this location to see if anything there looks fallible.

Yeah, I had done this with the GetIfaceStubInfo crash, but nothing stood out. I've dumped the listing for that function at #70666 (comment) so you can take a look for yourself.

If I can reproduce it locally it would be interesting to see whether the panic backtrace corresponds to the telemetry trace.

@findleyr
Copy link
Member

findleyr commented Dec 5, 2024

In that case, one panic (the plausible one) is goPanicIndex while the other is panicmem, so it seems unlikely that these are the exact same panic but with different locations.

types.Signature.results can be nil if there are no results, or have length less than the number of syntactic results in the return statement. So both the panicmem and goPanicIndex could have the same root cause.

@prattmic
Copy link
Member

prattmic commented Dec 5, 2024

Another possibility not yet mentioned is memory corruption. However, given this are coming from multiple GOOS/GOARCH, it isn't just one user, and the relatively small scale of telemetry collection makes it seem unlikely there would be so many unique corrupt reporters.

@findleyr
Copy link
Member

findleyr commented Dec 5, 2024

@prattmic I can reproduce the crash report of #70634 with gopls@v0.17.0-pre.3, and the repro from my fix for #70636.

Repro:

go install golang.org/x/tools/gopls@v0.17.0-pre.3

Paste this into a buffer:

var xx int
var xy string


func _() {
	return Foo(x) // <- trigger completion after 'x'
}

func Foo[T any](t T) T {}

Trigger completion where noted.

@findleyr
Copy link
Member

findleyr commented Dec 5, 2024

@adonovan, @prattmic, and I looked into this together: looks like an off-by-one error for inlined calls in CallersFrames or the output fed to SetCrashOutput.

@prattmic prattmic changed the title cmd/compile: confusing backtraces (incorrect pclntab?) runtime: CallersFrame returns incorrect frame for panics in inlined frame Dec 5, 2024
@prattmic
Copy link
Member

prattmic commented Dec 5, 2024

I have a simplified local reproducer.

func deref(v *int) int {
        return *v
}

var ptr *int
var sink int

func InlineCrash() {
        sink = deref(ptr)
}

This panics like so:

panic: runtime error: invalid memory address or nil pointer dereference                                                                                    
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x4f8dc3]                                                                                     
                                                                               
goroutine 1 gp=0xc000002380 m=0 mp=0x649c80 [running]:                                                                                                      
panic({0x51c3a0?, 0x641d30?})                                                  
        /usr/local/google/home/mpratt/src/go/src/runtime/panic.go:806 +0x168 fp=0xc000088e00 sp=0xc000088d50 pc=0x468e68    
runtime.panicmem(...)    
        /usr/local/google/home/mpratt/src/go/src/runtime/panic.go:262          
runtime.sigpanic()    
        /usr/local/google/home/mpratt/src/go/src/runtime/signal_unix.go:925 +0x359 fp=0xc000088e60 sp=0xc000088e00 pc=0x46add9    
main.deref(...)                                                                
        /usr/local/google/home/mpratt/src/go/src/runtime/testdata/testprog/callers.go:23                                                                   
main.InlineCrash()                                                             
        /usr/local/google/home/mpratt/src/go/src/runtime/testdata/testprog/callers.go:34 +0x63 fp=0xc000088eb8 sp=0xc000088e60 pc=0x4f8dc3                 
main.main()                                                                    
        /usr/local/google/home/mpratt/src/go/src/runtime/testdata/testprog/main.go:34 +0x133 fp=0xc000088f50 sp=0xc000088eb8 pc=0x500853    
runtime.main()    
        /usr/local/google/home/mpratt/src/go/src/runtime/proc.go:283 +0x28b fp=0xc000088fe0 sp=0xc000088f50 pc=0x43742b    
runtime.goexit({})    
        /usr/local/google/home/mpratt/src/go/src/runtime/asm_amd64.s:1700 +0x1 fp=0xc000088fe8 sp=0xc000088fe0 pc=0x46fda1

Parsing the PCs from the panic stack trace and feeding them to CallersFrames results in these frames:

Frame: PC 0x468e67 Function runtime.gopanic
Frame: PC 0x46add8 Function runtime.panicmem
Frame: PC 0x46ada8 Function runtime.sigpanic
Frame: PC 0x4f8dc2 Function main.InlineCrash
Frame: PC 0x500852 Function main.main
Frame: PC 0x43742a Function runtime.main
Frame: PC 0x46fda0 Function runtime.goexit

Notice that the deref frame is missing.

This only occurs if the crash is the first instruction in an inlined function. If we change deref to add more instructions, like so:

func deref(v *int) int {
        sink = 1
        return *v
}

Then we get correct frames:

Frame: PC 0x468e67 Function runtime.gopanic
Frame: PC 0x46add8 Function runtime.panicmem
Frame: PC 0x46ada8 Function runtime.sigpanic
Frame: PC 0x4f8dcd Function main.deref
Frame: PC 0x4f8dbc Function main.InlineCrash
Frame: PC 0x500872 Function main.main
Frame: PC 0x43742a Function runtime.main
Frame: PC 0x46fda0 Function runtime.goexit

CallersFrames attempts to handle inputs missing inlined frames (such as PCs parsed from a stack trace), but it looks like it only considers PCs that are inline "call sites". When a panic occurs, the PC is not a call site.

In addition to panics, I think the same issue may occur if the leaf PC is in an inlined frame, though I'm not sure if one could actually get such a stack trace from the runtime?

cc @golang/runtime

@prattmic
Copy link
Member

prattmic commented Dec 5, 2024

https://go.dev/cl/633998 has the full test case.

@adonovan
Copy link
Member Author

adonovan commented Dec 5, 2024

@prattmic has reduced and isolated the problem in #70637 (comment). Closing this issue as a dup of that one.

@adonovan adonovan closed this as not planned Won't fix, can't repro, duplicate, stale Dec 5, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Go Compiler / Runtime Dec 5, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/633998 mentions this issue: WIP: runtime: CallersFrame inline crash test

@prattmic
Copy link
Member

prattmic commented Dec 5, 2024

Can't be a dupe of ourselves!

@prattmic prattmic reopened this Dec 5, 2024
@github-project-automation github-project-automation bot moved this from Done to In Progress in Go Compiler / Runtime Dec 5, 2024
@adonovan
Copy link
Member Author

adonovan commented Dec 5, 2024

Sorry. Tired today.

@prattmic
Copy link
Member

prattmic commented Dec 5, 2024

I don't see an obvious fix for this. We need to detect that we should not decrement the PC for this frame since it doesn't correspond to the PC after a CALL. And we need to do so in a way that does affect users of Callers, which already provides inlined frames.

We may want to do something similar to framepointer unwinding for tracing, where we include a sentinel frame indicating whether inlined frames are already included (Callers would return with this sentinel).

That would at least make any fix here more maintainable, as it we would always know which scheme we fall into.

@prattmic
Copy link
Member

prattmic commented Dec 5, 2024

We may want to do something similar to framepointer unwinding for tracing, where we include a sentinel frame indicating whether inlined frames are already included (Callers would return with this sentinel).

This is probably untenable. We definitely couldn't make the first frame a sentinel, as lots of users grab the first PC to get a caller PC. Maybe it could go at the end? But I also see users that call FuncForPC on every PC even though we tell them not to.

@prattmic
Copy link
Member

prattmic commented Dec 6, 2024

I believe the only ways for a PC in a stack frame to be something other than a return address are:

  1. It is the leaf frame.
  2. Or, it was the site of some sort of out-of-band jump away (e.g., a signal like SIGSEGV).

It feels ugly, but perhaps we should hard-code heuristics for (2) as there aren't many different types of these signal jumps. In other words, if the preceding frame is runtime.sigpanic, then don't decrement the PC in the following frame.

@prattmic
Copy link
Member

prattmic commented Dec 6, 2024

I am not confident about being able to fix this within the current API. The fundamental issue is that Callers and text stack traces report slightly different PCs:

Callers always reports "return addresses" from which you can subtract 1 to get back to the "call". Even for traps, Callers reports "trap pc + 1", so the subtraction rule can be applied consistently. See tracebackPCs, where symPC returns the trap PC and pcBuf gets pc + 1.

This allows using FuncForPC(pc-1) to symbolize the results from Callers. Though we "discourage" this, as far as I can tell it works perfectly fine. (I'd appreciate it if anyone knows cases this doesn't work). Thus I don't think we can change the behavior of Callers.

Text stack traces report return PCs for actual calls, but for traps they report the actual trap PC (no +1). This is important to make reports make sense to humans. If we reported "trap PC + 1", I am sure some folks would get confused into thinking their crash was because the program tried to jump into the middle of an instruction. Thus I don't think we can change the behavior of text stack traces.

So we have two different behaviors and can't change either. As far as can tell, there is no 100% correct way for CallersFrames to determine which source its input came from. e.g., if there is no inlining then the PC buffer for a panic would be identical between Callers and the text stack trace except for +1 on the trap PC from Callers.


Though it adds more complexity, I believe golang.org/x/telemetry/internal/crashmonitor could detect and fixup these cases.

The runtime knows that "trap" frames are those following runtime.sigpanic, runtime.asyncpreempt, or runtime.debugCallV2 (practically speaking, only runtime.sigpanic should appear in crash reports). If the crash monitor
sees one of these frames, it can increment the PC of the following frame just as the runtime does for Callers, so that the input will be in the form expected by CallersFrames.

(That said, I won't be surprised if we find more edge cases that need fixup in the future.)

@ianlancetaylor
Copy link
Member

I agree that in general we can't simply turn the addresses from a stack trace into a sequence of addresses suitable for runtime.CallersFrames. To do that at all, we have to treat certain addresses specially. Since there should be a limited set of those addresses, we should be able to string match on the text to adjust the PC appropriately. Am I missing something?

@prattmic
Copy link
Member

prattmic commented Dec 6, 2024

Am I missing something?

I don't think so. I believe we are both suggesting the same thing.

@findleyr
Copy link
Member

Summarizing, based on in-person conversation: @prattmic suggests that the special handling needed by x/telemetry is just: if the panic is sigpanic, increment the first PC in the stack. @adonovan asks that we document this limitation in CallersFrames.

Michael, Alan, did I get that right?

If so, perhaps a path forward is to:

  1. Turn this into a proposal for the new documentation on CallersFrames, for 1.25.
  2. Go ahead and implement this workaround in x/telemetry.

I'd really like to fix the inaccurate crash reports from the x/telemetry crash monitor.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/637755 mentions this issue: internal/crashmonitor: adjust trap frames

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/637756 mentions this issue: internal/crashmonitor: clarify pre-go1.23 behavior

@cherrymui
Copy link
Member

Sorry for being late to the discussion, but reading the comments here again, I wonder if we could combine your "sentinel" idea into it. That is, in CallersFrames, if the first PC (or last, whatever) is the sentinel, we assume the PCs are from a textual stack trace (or otherwise manually constructed), and don't do the -1 for traps. Otherwise (if the sentinel is absent), we assume it is from Callers and just do the old behavior. Perhaps we could also guard the inline expansion based on the mode.

Then telemetry could just pass the PCs from the stack trace, plus the sentinel.

gopherbot pushed a commit to golang/telemetry that referenced this issue Dec 20, 2024
The current comment implies that traceback includes PCs for inlined
frames in 1.23, which is not the case.

For golang/go#70637.

Change-Id: I26fe4328843733d230e31ccdb452d4a25ae91b3c
Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/637756
Reviewed-by: Alan Donovan <adonovan@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Robert Findley <rfindley@google.com>
@github-project-automation github-project-automation bot moved this from In Progress to Done in Go Compiler / Runtime Dec 20, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/638015 mentions this issue: runtime: test trap panic parsing in TestTracebackSystem

@prattmic
Copy link
Member

@cherrymui Apologies, I didn't see your comment before submitting my CL.

IIUC, the core difference of my "sentinel" proposal vs yours is that I was thinking that runtime.Callers would return with a sentinel, while PCs reconstructed from a stack trace would not. (This doesn't work because users want to pass each PC from Callers to FuncForPC)

You are suggesting the opposite: Callers returns only PCs, but PCs reconstructed from a stack trace would explicitly need to add a sentinel.

I think that could work, I hadn't considered that.

Since it requires explicit changes by callers parsing stack traces, it is basically a round-about way of adding a new API / new argument to CallersFrames. From that perspective I'm not a huge fan vs actually adding a new API, which would be easier to use than a magic sentinel.

gopherbot pushed a commit that referenced this issue Dec 20, 2024
This mirrors https://go.dev/cl/637755, as x/telemetry is now aware of
sigpanic preceding trap frames.

For #70637.

Change-Id: I13a775f25e89047702d4f2d463ce3210bcf192d9
Reviewed-on: https://go-review.googlesource.com/c/go/+/638015
Reviewed-by: Cherry Mui <cherryyz@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
@cherrymui
Copy link
Member

No worries :) Your CL is totally fine.

it is basically a round-about way of adding a new API / new argument to CallersFrames.

Agree. Using a magical sentinel to hide a new API/use case is not nice. Perhaps the need will diminish once we have structured stack trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Development

No branches or pull requests

7 participants