-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: scheduler sometimes starves a runnable goroutine on wasm platforms #65178
Comments
This also affects
|
Does this also affect linux-amd64 if GOMAXPROCS=1? |
No, it does not: if I update the test to explicitly call |
Change https://go.dev/cl/557037 mentions this issue: |
Change https://go.dev/cl/557038 mentions this issue: |
At a theoretical level, this doesn't surprise me, and I would expect it to impact linux-amd64 GOMAXPROCS=1. This test always has at least one runnable goroutine (one of the channel send/recv goroutines). In fact, because the channel is buffered in theory both goroutines may be runnable all of the time. When the timer fires, a third goroutine will become runnable [1]. We have one thread on which to run three goroutines, so there must be some scheduling delay. Go does not have a real-time scheduler, or even a best-effort deadline-based scheduler, so it can theoretically have unbounded scheduling delay. Readying a goroutine typically puts it on the head of the run queue to get it to run soon. But since that applies to almost every ready, the timer firing could put goroutine 3 on the head of the run queue, and then goroutine 1 or 2 could replace the head before the scheduler runs again. If you dump the There are certainly many improvements that could go into the scheduler to try to make this situation better (#51071 comes to mind). The more interesting case would be if this is actually completely stuck. i.e., the timer expiration or ready goroutine was somehow lost altogether. Figuring that out by running a test is a bit of a halting problem situation, but looking at [1] We always run timers before looking for other work in the scheduler, so the timer should always fire. But an AfterFunc timer just starts a goroutine to run the func, which is the third goroutine. |
I agree that the runtime is theoretically allowed to produce this behavior according to the language spec. However, the addition of non-cooperative preemption (#24543) set up a strong expectation that every runnable goroutine will eventually be scheduled, and resolved a large number of previous scheduler fairness issues (linked from that proposal). We also explicitly made mutex locking starvation-free in #13086. To me, that suggests that this sort of goroutine starvation is not intended behavior. This sort of starvation shows up in realistic programs that use |
Fixes #65177. Updates #65178. Updates #64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com>
I think I poorly worded my previous message and came across too strongly. I don't think this is good behavior, I meant that there are two broad classes this problem may be in:
|
Ah, yeah. I think the evidence so far supports (1) — the goroutine dumps from the timed-out tests list the (The fact that it is listed as |
I thought this could be working on amd64 because the goroutines are being non-cooperatively preempted by GC which could randomly restart execution. But trying out with |
Even more minified example reproduce for me on wasip1: func TestAfterFuncStarvation(t *testing.T) {
var (
wg sync.WaitGroup
stop atomic.Bool
- c = make(chan bool, 1)
+ c = make(chan bool)
)
wg.Add(2)
go func() {
for !stop.Load() {
c <- true
}
close(c)
wg.Done()
}()
go func() {
for range c {}
wg.Done()
}()
AfterFunc(1 * Microsecond, func() { stop.Store(true) })
wg.Wait()
} It wouldn't reliably fail with |
Even more surprisingly this works on func TestAfterFuncStarvation2(t *testing.T) {
var stop atomic.Bool
AfterFunc(1 * Microsecond, func() { stop.Store(true) })
for !stop.Load() {}
} But expectedly never works on |
I am able to repro on Line 170 in f19f31f
-count=10000 works in 0.173s but -count=100000 timeout with equivalent stack traces.
This explains why |
The test is skipped on wasm platforms for now, because it successfully detects a starvation bug on those platforms. For #65178. Change-Id: I05d28f1c7be99fcab67ec4dfaa38f412e11fd3cb Reviewed-on: https://go-review.googlesource.com/c/go/+/557038 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com>
Thanks for digging @Jorropo! sysmon preempts long-running goroutines, so without that preemption, the tight Still, the lack of sysmon on wasm makes me wonder if we should be more aggressive about self-preemption there. |
I had the same question, I don't see how to solve the tight loop without the compiler aggressively injecting I find it interesting that the original example repro on amd64 after I disable sysmon. |
On non-wasm platforms, we already do this using the "inheritTime" result from |
Change https://go.dev/cl/557615 mentions this issue: |
CL 557037 added a runtime.Gosched to prevent goroutine starvation in the wasm fake-net stack. Unfortunately, that Gosched causes the scheduler to enter a very similar starvation loop in this test. Add another runtime.Gosched to break this new loop. For golang/go#65178. Change-Id: I24b3f50dd728800462f71f27290b0d8f99d5ae5b Cq-Include-Trybots: luci.golang.try:x_net-gotip-wasip1-wasm_wasmtime,x_net-gotip-wasip1-wasm_wazero,x_net-gotip-js-wasm Reviewed-on: https://go-review.googlesource.com/c/net/+/557615 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Michael Pratt <mpratt@google.com>
Change https://go.dev/cl/557437 mentions this issue: |
With https://go.dev/cl/557437, the original repro no longer hangs. This does not do anything to impact goroutines that simply never yield at all, like #65178 (comment). |
That sounds great! I think it's reasonable to only fix goroutines that actually yield for now, since that sort of live-lock is relatively much easier to diagnose. (For the single-goroutine case, the live-locked goroutine will actually be running in a goroutine dump, as opposed to “running some of the time”.) |
Found new dashboard test flakes for:
2024-01-10 15:35 x_net-gotip-js-wasm net@07e05fd6 go@6e7aee5f x/net/http2.TestServer_Push_RejectAfterGoAway (log)
2024-01-10 15:35 x_net-gotip-wasip1-wasm_wasmtime net@07e05fd6 go@6e7aee5f x/net/http2.TestServer_Push_RejectAfterGoAway (log)
2024-01-10 15:35 x_net-gotip-wasip1-wasm_wazero net@07e05fd6 go@6e7aee5f x/net/http2.TestServer_Push_RejectAfterGoAway (log)
|
When readying a goroutine, the scheduler typically places the readied goroutine in pp.runnext, which will typically be the next goroutine to run in the schedule. In order to prevent a set of ping-pong goroutines from simply switching back and forth via runnext and starving the rest of the run queue, a goroutine scheduled via runnext shares a time slice (pp.schedtick) with the previous goroutine. sysmon detects "long-running goroutines", which really means Ps using the same pp.schedtick for too long, and preempts them to allow the rest of the run queue to run. Thus this avoids starvation via runnext. However, wasm has no threads, and thus no sysmon. Without sysmon to preempt, the possibility for starvation returns. Avoid this by disabling runnext entirely on wasm. This means that readied goroutines always go on the end of the run queue and thus cannot starve via runnext. Note that this CL doesn't do anything about single long-running goroutines. Without sysmon to preempt them, a single goroutine that fails to yield will starve the run queue indefinitely. For #65178. Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Change-Id: I7dffe1e72c6586474186b72f8068cff77b661eae Reviewed-on: https://go-review.googlesource.com/c/go/+/557437 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Bryan Mills <bcmills@google.com>
Change https://go.dev/cl/559237 mentions this issue: |
This reverts CL 557437. Reason for revert: Appears to have broken wasip1 builders. For #65178. Change-Id: I59c1a310eb56589c768536fe444c1efaf862f8b0 Reviewed-on: https://go-review.googlesource.com/c/go/+/559237 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
Change https://go.dev/cl/559835 mentions this issue: |
Change https://go.dev/cl/559798 mentions this issue: |
Fixes golang#65177. Updates golang#65178. Updates golang#64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com>
The test is skipped on wasm platforms for now, because it successfully detects a starvation bug on those platforms. For golang#65178. Change-Id: I05d28f1c7be99fcab67ec4dfaa38f412e11fd3cb Reviewed-on: https://go-review.googlesource.com/c/go/+/557038 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com>
When readying a goroutine, the scheduler typically places the readied goroutine in pp.runnext, which will typically be the next goroutine to run in the schedule. In order to prevent a set of ping-pong goroutines from simply switching back and forth via runnext and starving the rest of the run queue, a goroutine scheduled via runnext shares a time slice (pp.schedtick) with the previous goroutine. sysmon detects "long-running goroutines", which really means Ps using the same pp.schedtick for too long, and preempts them to allow the rest of the run queue to run. Thus this avoids starvation via runnext. However, wasm has no threads, and thus no sysmon. Without sysmon to preempt, the possibility for starvation returns. Avoid this by disabling runnext entirely on wasm. This means that readied goroutines always go on the end of the run queue and thus cannot starve via runnext. Note that this CL doesn't do anything about single long-running goroutines. Without sysmon to preempt them, a single goroutine that fails to yield will starve the run queue indefinitely. For golang#65178. Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Change-Id: I7dffe1e72c6586474186b72f8068cff77b661eae Reviewed-on: https://go-review.googlesource.com/c/go/+/557437 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com> Reviewed-by: Michael Knyszek <mknyszek@google.com> Reviewed-by: Bryan Mills <bcmills@google.com>
This reverts CL 557437. Reason for revert: Appears to have broken wasip1 builders. For golang#65178. Change-Id: I59c1a310eb56589c768536fe444c1efaf862f8b0 Reviewed-on: https://go-review.googlesource.com/c/go/+/559237 LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Michael Pratt <mpratt@google.com> Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org> Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
@gopherbot, please backport to Go 1.22. The additional tests enabled in https://go.dev/cl/526117 made it more likely for this bug to be exposed in test runs, which may cause flakiness in pre-release (and backport CL) testing. |
Backport issue(s) opened: #65883 (for 1.22). Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases. |
Change https://go.dev/cl/566175 mentions this issue: |
This breaks an unbounded client-side retry loop if the server's timeout happens to fire during its final read of the TLS handshake. The retry loop was observed on wasm platforms at CL 557437. I was also able to reproduce chains of dozens of retries on my linux/amd64 workstation by adjusting some timeouts and adding a couple of sleeps, as in this patch: https://gist.github.com/bcmills/d0a0a57e5f64eebc24e8211d8ea502b3 However, on linux/amd64 on my workstation the test always eventually breaks out of the retry loop due to timing jitter. I couldn't find a retry-specific hook in the http.Client, http.Transport, or tls.Config structs, so I have instead abused the Transport.Proxy hook for this purpose. Separately, we may want to consider adding a retry-specific hook, or changing the net/http implementation to avoid transparently retrying in this case. Fixes #65410. Updates #65178. Change-Id: I0e43c039615fe815f0a4ba99a8813c48b1fdc7e6 Reviewed-on: https://go-review.googlesource.com/c/go/+/559835 Reviewed-by: Damien Neil <dneil@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Michael Pratt <mpratt@google.com>
When readying a goroutine, the scheduler typically places the readied goroutine in pp.runnext, which will typically be the next goroutine to run in the schedule. In order to prevent a set of ping-pong goroutines from simply switching back and forth via runnext and starving the rest of the run queue, a goroutine scheduled via runnext shares a time slice (pp.schedtick) with the previous goroutine. sysmon detects "long-running goroutines", which really means Ps using the same pp.schedtick for too long, and preempts them to allow the rest of the run queue to run. Thus this avoids starvation via runnext. However, wasm has no threads, and thus no sysmon. Without sysmon to preempt, the possibility for starvation returns. Avoid this by disabling runnext entirely on wasm. This means that readied goroutines always go on the end of the run queue and thus cannot starve via runnext. Note that this CL doesn't do anything about single long-running goroutines. Without sysmon to preempt them, a single goroutine that fails to yield will starve the run queue indefinitely. For #65178. Change-Id: I10859d088776125a2af8c9cd862b6e071da628b5 Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/559798 Auto-Submit: Bryan Mills <bcmills@google.com> Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
… on js and wasip1 For #65883. Updates #65177. Updates #65178. Updates #64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com> (cherry picked from commit f19f31f) Reviewed-on: https://go-review.googlesource.com/c/go/+/566175 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
@prattmic, with https://go.dev/cl/559798 merged, should we close this issue? (Or is there more to do?) |
I guess this doesn't address the case where a single long-running goroutine never yields. Not sure whether we want to consider that in-scope for this issue, or treat it as a separate feature request to support goroutine preemption on |
… on js and wasip1 For golang#65883. Updates golang#65177. Updates golang#65178. Updates golang#64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com> (cherry picked from commit f19f31f) Reviewed-on: https://go-review.googlesource.com/c/go/+/566175 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
… on js and wasip1 For golang#65883. Updates golang#65177. Updates golang#65178. Updates golang#64321. Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32 Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero Reviewed-on: https://go-review.googlesource.com/c/go/+/557037 Auto-Submit: Bryan Mills <bcmills@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Reviewed-by: Damien Neil <dneil@google.com> (cherry picked from commit f19f31f) Reviewed-on: https://go-review.googlesource.com/c/go/+/566175 Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
I think that should be a separate issue, or at least we should rename this to be more clearly about async preemption. |
Go version
go version devel go1.23-1653833811 Fri Jan 19 19:23:40 2024 +0000 linux/amd64
Output of
go env
in your module/workspace:What did you do?
Add the following test to the
time
package:Then run it using:
What did you see happen?
What did you expect to see?
The test completes 1000 iterations on
wasip1/wasm
in about 1.2 seconds on my machine, so 10000 iterations should only take ~12 seconds:Running native code on
linux/amd64
is much faster still, but that's to be expected:(attn @golang/wasm)
The text was updated successfully, but these errors were encountered: