-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: CallersFrame returns incorrect frame for panics in inlined frame #70637
Comments
CC @golang/compiler |
The most stark example of this is https://go.dev/issue/70634, where the panic is immediately preceded by a nil check. In that case, there is a plausible panic, on the same line (see my comment), which is correctly reported in #70636. Perhaps that provides a clue. |
For cases that don't appear to have a fallable operation at all, perhaps we could build the exact binary and dump out all of the PCs that report this location to see if anything there looks fallable. |
In that case, one panic (the plausible one) is goPanicIndex while the other is panicmem, so it seems unlikely that these are the exact same panic but with different locations. |
Yeah, I had done this with the GetIfaceStubInfo crash, but nothing stood out. I've dumped the listing for that function at #70666 (comment) so you can take a look for yourself. If I can reproduce it locally it would be interesting to see whether the panic backtrace corresponds to the telemetry trace. |
|
Another possibility not yet mentioned is memory corruption. However, given this are coming from multiple GOOS/GOARCH, it isn't just one user, and the relatively small scale of telemetry collection makes it seem unlikely there would be so many unique corrupt reporters. |
@prattmic I can reproduce the crash report of #70634 with gopls@v0.17.0-pre.3, and the repro from my fix for #70636. Repro:
Paste this into a buffer:
Trigger completion where noted. |
I have a simplified local reproducer.
This panics like so:
Parsing the PCs from the panic stack trace and feeding them to CallersFrames results in these frames:
Notice that the This only occurs if the crash is the first instruction in an inlined function. If we change
Then we get correct frames:
CallersFrames attempts to handle inputs missing inlined frames (such as PCs parsed from a stack trace), but it looks like it only considers PCs that are inline "call sites". When a panic occurs, the PC is not a call site. In addition to panics, I think the same issue may occur if the leaf PC is in an inlined frame, though I'm not sure if one could actually get such a stack trace from the runtime? cc @golang/runtime |
https://go.dev/cl/633998 has the full test case. |
@prattmic has reduced and isolated the problem in #70637 (comment). Closing this issue as a dup of that one. |
Change https://go.dev/cl/633998 mentions this issue: |
Can't be a dupe of ourselves! |
Sorry. Tired today. |
I don't see an obvious fix for this. We need to detect that we should not decrement the PC for this frame since it doesn't correspond to the PC after a CALL. And we need to do so in a way that does affect users of Callers, which already provides inlined frames. We may want to do something similar to framepointer unwinding for tracing, where we include a sentinel frame indicating whether inlined frames are already included (Callers would return with this sentinel). That would at least make any fix here more maintainable, as it we would always know which scheme we fall into. |
This is probably untenable. We definitely couldn't make the first frame a sentinel, as lots of users grab the first PC to get a caller PC. Maybe it could go at the end? But I also see users that call FuncForPC on every PC even though we tell them not to. |
I believe the only ways for a PC in a stack frame to be something other than a return address are:
It feels ugly, but perhaps we should hard-code heuristics for (2) as there aren't many different types of these signal jumps. In other words, if the preceding frame is |
I am not confident about being able to fix this within the current API. The fundamental issue is that Callers and text stack traces report slightly different PCs: Callers always reports "return addresses" from which you can subtract 1 to get back to the "call". Even for traps, Callers reports "trap pc + 1", so the subtraction rule can be applied consistently. See This allows using Text stack traces report return PCs for actual calls, but for traps they report the actual trap PC (no +1). This is important to make reports make sense to humans. If we reported "trap PC + 1", I am sure some folks would get confused into thinking their crash was because the program tried to jump into the middle of an instruction. Thus I don't think we can change the behavior of text stack traces. So we have two different behaviors and can't change either. As far as can tell, there is no 100% correct way for CallersFrames to determine which source its input came from. e.g., if there is no inlining then the PC buffer for a panic would be identical between Callers and the text stack trace except for +1 on the trap PC from Callers. Though it adds more complexity, I believe golang.org/x/telemetry/internal/crashmonitor could detect and fixup these cases. The runtime knows that "trap" frames are those following (That said, I won't be surprised if we find more edge cases that need fixup in the future.) |
I agree that in general we can't simply turn the addresses from a stack trace into a sequence of addresses suitable for |
I don't think so. I believe we are both suggesting the same thing. |
Summarizing, based on in-person conversation: @prattmic suggests that the special handling needed by x/telemetry is just: if the panic is sigpanic, increment the first PC in the stack. @adonovan asks that we document this limitation in CallersFrames. Michael, Alan, did I get that right? If so, perhaps a path forward is to:
I'd really like to fix the inaccurate crash reports from the x/telemetry crash monitor. |
Change https://go.dev/cl/637755 mentions this issue: |
Change https://go.dev/cl/637756 mentions this issue: |
Sorry for being late to the discussion, but reading the comments here again, I wonder if we could combine your "sentinel" idea into it. That is, in CallersFrames, if the first PC (or last, whatever) is the sentinel, we assume the PCs are from a textual stack trace (or otherwise manually constructed), and don't do the -1 for traps. Otherwise (if the sentinel is absent), we assume it is from Callers and just do the old behavior. Perhaps we could also guard the inline expansion based on the mode. Then telemetry could just pass the PCs from the stack trace, plus the sentinel. |
The current comment implies that traceback includes PCs for inlined frames in 1.23, which is not the case. For golang/go#70637. Change-Id: I26fe4328843733d230e31ccdb452d4a25ae91b3c Reviewed-on: https://go-review.googlesource.com/c/telemetry/+/637756 Reviewed-by: Alan Donovan <adonovan@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Robert Findley <rfindley@google.com>
Change https://go.dev/cl/638015 mentions this issue: |
@cherrymui Apologies, I didn't see your comment before submitting my CL. IIUC, the core difference of my "sentinel" proposal vs yours is that I was thinking that You are suggesting the opposite: Callers returns only PCs, but PCs reconstructed from a stack trace would explicitly need to add a sentinel. I think that could work, I hadn't considered that. Since it requires explicit changes by callers parsing stack traces, it is basically a round-about way of adding a new API / new argument to CallersFrames. From that perspective I'm not a huge fan vs actually adding a new API, which would be easier to use than a magic sentinel. |
This mirrors https://go.dev/cl/637755, as x/telemetry is now aware of sigpanic preceding trap frames. For #70637. Change-Id: I13a775f25e89047702d4f2d463ce3210bcf192d9 Reviewed-on: https://go-review.googlesource.com/c/go/+/638015 Reviewed-by: Cherry Mui <cherryyz@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
No worries :) Your CL is totally fine.
Agree. Using a magical sentinel to hide a new API/use case is not nice. Perhaps the need will diminish once we have structured stack trace. |
We've noticed a number of recent gopls panics in which the leaf function in the backtrace doesn't appear to contain a fallible operation.
I wonder whether the pclntab is incorrect in these cases.
@findleyr
The text was updated successfully, but these errors were encountered: