-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: broken work conservation due to CL 310850 #45867
Comments
2021-11-17T04:31:29-fceca2c/freebsd-arm64-dmgk |
Marking as release-blocker for Go 1.18 due to the high failure rate. (Note that at least one failure within the current cycle is on a |
Looking at the test, and specifically the comment
I suspect that what's wrong here is that a GC is already in progress, and this is fallout from the new minimum heap size making that more likely. This is probably already mitigated by https://go.dev/cl/368137, but still theoretically a failure here is possible. Indeed, the last failure was on Dec 2nd (going by the I'll send a CL to fix this for good. |
Change https://golang.org/cl/369747 mentions this issue: |
These tests disable GC because of the potential for a deadlock, but don't consider that a GC could be in progress due to other tests. The likelihood of this case was increased when the minimum heap size was lowered during the Go 1.18 cycle. The issue was then mitigated by CL 368137 but in theory is always a problem. This change is intended specifically for #45867, but I just walked over a whole bunch of other tests that don't take this precaution where it seems like it could be relevant (some tests it's not, like the UserForcedGC test, or testprogs where no other code has run before it). Fixes #45867. Change-Id: I6a1b4ae73e05cab5a0b2d2cce14126bd13be0ba5 Reviewed-on: https://go-review.googlesource.com/c/go/+/369747 Reviewed-by: Michael Pratt <mpratt@google.com> Reviewed-by: David Chase <drchase@google.com> Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Update on the root cause here: the fixes for #49680, #49695, and #45867 all assumed that |
Change https://golang.org/cl/369815 mentions this issue: |
Fixes for #49680, #49695, #45867, and #49370 all assumed that SetGCPercent(-1) doesn't block until the GC's mark phase is done, but it actually does. The cause of 3 of those 4 failures comes from the fact that at the beginning of the sweep phase, the GC does try to preempt every P once, and this may run concurrently with test code. In the fourth case, the issue was likely that only *one* of the debug_test.go tests was missing a call to SetGCPercent(-1). Just to be safe, leave a TODO there for now to remove the extraneous runtime.GC calls, but leave the calls in. Updates #49680, #49695, #45867, and #49370. Change-Id: Ibf4e64addfba18312526968bcf40f1f5d54eb3f1 Reviewed-on: https://go-review.googlesource.com/c/go/+/369815 Reviewed-by: Austin Clements <austin@google.com> Trust: Michael Knyszek <mknyszek@google.com> Run-TryBot: Michael Knyszek <mknyszek@google.com> TryBot-Result: Gopher Robot <gobot@golang.org>
Unfortunately this symptom is still present: https://build.golang.org/log/9c21b1044b95e285295c9cf6040932dd3f5d63b2 occurred after CL 369747. |
Hmm. The test still deadlocked, but Perhaps at least the failure mechanism changed! |
This regexp matches the one new failure observed since that change.
|
Two makes it a pattern, right? 😅
2021-12-08T17:55:13-c1c303f/freebsd-arm64-dmgk |
I'm having a really hard time reproducing this on ppc64. I'm spinning it in a loop, and after about 60 mins in 1 gomote I've got nothing. I may scale this up. Looking at this failure we do have more closely, it looks like a problem in the scheduler! Note that in https://build.golang.org/log/9c21b1044b95e285295c9cf6040932dd3f5d63b2 we're currently in the
The main goroutine (177239) is blocked on the If that goroutine (177301) were to run, the test would get unstuck. But the question is... why is it not running? What scheduler list is it sitting in for 10 whole minutes such that nothing can see it? Sure, 3 of 4 Ps are taken up by these non-preemptible goroutines that are actively spinning, but 1 P should be free to run code. There are no other running goroutines that appear to be taking up the 4th P. This suggests that what we have here is a missing wakeup somewhere. I don't see any major changes that landed in the scheduler this cycle. I do see two potential culprit CLs in Gerrit history (my search methods were crude, so perhaps there are more). There's https://golang.org/cl/310850 from May, but if that was the culprit we'd be seeing this go back much further in time to the 1.17 cycle. Then, the only other CL I see is https://golang.org/cl/317509 from August. If I could reproduce, I'd try reverting either of these, but so far no dice. (Side note: taking a more fundamental approach to the code... I'm also curious as to why this test deadlocks due to preemption at all (even before I added the |
#50749 explains why async preemption did not kick in to allow forward progress. Disabling A few ideas I've brainstormed:
|
FWIW, explicitly setting |
Interesting. That explains that I tried to run that test in isolation but never reproduced the failure... |
Correct, the test should no longer fail.
To be clear, this isn't quite "only a performance issue". The failing tests were deadlocked and would never recover. You are correct that "requires a pretty specific and subtle setup to trigger." I think dropping this as a release blocker is OK, as this issue is new in 1.17, not 1.18. We may still want to backport the eventual fix to both 1.17 and 1.18. |
Change https://go.dev/cl/389014 mentions this issue: |
@prattmic This issue is in the 1.19 milestone. I'm not clear on the status. Should this be 1.20? Backlog? Closed? Thanks. |
2021-04-30T00:32:42-303b194/linux-amd64-noopt
2020-06-26T17:07:58-d1015f3/freebsd-386-11_2
Also seen on a recent TryBot:
https://storage.googleapis.com/go-build-log/eb7ebec0/freebsd-amd64-12_2_913d0c87.log
CC @jeremyfaller @aclements
The text was updated successfully, but these errors were encountered: