Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: scheduler sometimes starves a runnable goroutine on wasm platforms #65178

Open
bcmills opened this issue Jan 19, 2024 · 30 comments
Open
Assignees
Labels
arch-wasm WebAssembly issues compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Milestone

Comments

@bcmills
Copy link
Contributor

bcmills commented Jan 19, 2024

#!watchflakes
post <- goarch == "wasm" && pkg == "golang.org/x/net/http2" && test == "TestServer_Push_RejectAfterGoAway" && `timeout`

Go version

go version devel go1.23-1653833811 Fri Jan 19 19:23:40 2024 +0000 linux/amd64

Output of go env in your module/workspace:

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/usr/local/google/home/bcmills/.cache/go-build'
GOENV='/usr/local/google/home/bcmills/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/usr/local/google/home/bcmills/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/usr/local/google/home/bcmills'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/local/google/home/bcmills/go'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='local'
GOTOOLDIR='/usr/local/google/home/bcmills/go/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='devel go1.23-1653833811 Fri Jan 19 19:23:40 2024 +0000'
GCCGO='/usr/bin/gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='c++'
CGO_ENABLED='1'
GOMOD='/usr/local/google/home/bcmills/go/src/go.mod'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1312466808=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Add the following test to the time package:

func TestAfterFuncStarvation(t *testing.T) {
	var (
		wg sync.WaitGroup
		stop atomic.Bool
		c = make(chan bool, 1)
	)

	wg.Add(2)
	go func() {
		for !stop.Load() {
			c <- true
		}
		close(c)
		wg.Done()
	}()
	go func() {
		for range c {}
		wg.Done()
	}()

	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
	wg.Wait()
}

Then run it using:

GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=10000

What did you see happen?

~/go/src$ GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=10000
panic: test timed out after 10m0s
running tests:
        TestAfterFuncStarvation (9m56s)

goroutine 13835 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2366 +0x44
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5

goroutine 1 [chan receive]:
testing.(*T).Run(0x178b040, {0x6a007, 0x13}, 0xde460)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1750 +0x44
testing.runTests.func1(0x178b040)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2161 +0x3
testing.tRunner(0x178b040, 0x1499c80)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
testing.runTests(0x140a2a0, {0x37c140, 0x7f, 0x7f}, {0xc162d83908280698, 0x8bb3084da4, 0x37faa0})
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2159 +0x49
testing.(*M).Run(0x144e140)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2027 +0x6e
main.main()
        _testmain.go:445 +0x9

goroutine 13833 [runnable]:
time_test.TestAfterFuncStarvation.func2()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:102 +0x4
created by time_test.TestAfterFuncStarvation in goroutine 13831
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:101 +0xd

goroutine 13831 [semacquire, 9 minutes]:
sync.runtime_Semacquire(0x1790198)
        /usr/local/google/home/bcmills/go/src/runtime/sema.go:62 +0x2
sync.(*WaitGroup).Wait(0x1790190)
        /usr/local/google/home/bcmills/go/src/sync/waitgroup.go:116 +0xf
time_test.TestAfterFuncStarvation(0x178b1e0)
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:107 +0x12
testing.tRunner(0x178b1e0, 0xde460)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
created by testing.(*T).Run in goroutine 1
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1742 +0x43

goroutine 13832 [chan send]:
time_test.TestAfterFuncStarvation.func1()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:96 +0x3
created by time_test.TestAfterFuncStarvation in goroutine 13831
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:94 +0x9

goroutine 13834 [runnable]:
time_test.TestAfterFuncStarvation.func3()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:106
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5
FAIL    time    600.293s
FAIL

What did you expect to see?

The test completes 1000 iterations on wasip1/wasm in about 1.2 seconds on my machine, so 10000 iterations should only take ~12 seconds:

~/go/src$ GOARCH=wasm GOOS=wasip1 go test time -run=TestAfterFuncStarvation -count=1000
ok      time    1.228s

Running native code on linux/amd64 is much faster still, but that's to be expected:

go test time -run=TestAfterFuncStarvation -count=10000
ok      time    0.587s

(attn @golang/wasm)

@gopherbot gopherbot added the compiler/runtime Issues related to the Go compiler and/or runtime. label Jan 19, 2024
@bcmills bcmills added the arch-wasm WebAssembly issues label Jan 19, 2024
@bcmills
Copy link
Contributor Author

bcmills commented Jan 19, 2024

This also affects js/wasm: 100 iterations take <600ms, so 1000 iterations should take <6s (but times out after a full minute):

~/go/src$ GOARCH=wasm GOOS=js go test time -run=TestAfterFuncStarvation -count=100
ok      time    0.519s

~/go/src$ GOARCH=wasm GOOS=js go test time -run=TestAfterFuncStarvation -count=1000 -timeout=1m
panic: test timed out after 1m0s
running tests:
        TestAfterFuncStarvation (1m0s)

goroutine 2980 [running]:
testing.(*M).startAlarm.func1()
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2366 +0x44
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5

goroutine 1 [chan receive]:
testing.(*T).Run(0x151fa00, {0x6e19f, 0x17}, 0xe09b0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1750 +0x44
testing.runTests.func1(0x151fa00)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2161 +0x3
testing.tRunner(0x151fa00, 0x1493c80)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
testing.runTests(0x140a360, {0x37fc60, 0x7f, 0x7f}, {0xc162d954b7abb580, 0xdf9e00901, 0x38b6e0})
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2159 +0x49
testing.(*M).Run(0x144c3c0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:2027 +0x6e
main.main()
        _testmain.go:445 +0x9

goroutine 6 [waiting]:
runtime.gopark(0x0, 0x0, 0x0, 0x0, 0x1)
        /usr/local/google/home/bcmills/go/src/runtime/proc.go:402 +0x25
runtime.handleEvent()
        /usr/local/google/home/bcmills/go/src/runtime/lock_js.go:296 +0x25
runtime.goexit({})
        /usr/local/google/home/bcmills/go/src/runtime/asm_wasm.s:434 +0x1

goroutine 2976 [semacquire]:
sync.runtime_Semacquire(0x140c188)
        /usr/local/google/home/bcmills/go/src/runtime/sema.go:62 +0x2
sync.(*WaitGroup).Wait(0x140c180)
        /usr/local/google/home/bcmills/go/src/sync/waitgroup.go:116 +0xf
time_test.TestAfterFuncStarvation(0x151fba0)
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:107 +0x12
testing.tRunner(0x151fba0, 0xe09b0)
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1689 +0xd
created by testing.(*T).Run in goroutine 1
        /usr/local/google/home/bcmills/go/src/testing/testing.go:1742 +0x43

goroutine 2977 [chan send]:
time_test.TestAfterFuncStarvation.func1()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:96 +0x3
created by time_test.TestAfterFuncStarvation in goroutine 2976
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:94 +0x9

goroutine 2978 [runnable]:
time_test.TestAfterFuncStarvation.func2()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:102 +0x4
created by time_test.TestAfterFuncStarvation in goroutine 2976
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:101 +0xd

goroutine 2979 [runnable]:
time_test.TestAfterFuncStarvation.func3()
        /usr/local/google/home/bcmills/go/src/time/sleep_test.go:106
created by time.goFunc
        /usr/local/google/home/bcmills/go/src/time/sleep.go:177 +0x5
FAIL    time    60.358s
FAIL

@prattmic
Copy link
Member

Does this also affect linux-amd64 if GOMAXPROCS=1?

@bcmills
Copy link
Contributor Author

bcmills commented Jan 19, 2024

No, it does not: if I update the test to explicitly call runtime.GOMAXPROCS(1), it continues to pass after even 100000 runs.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/557037 mentions this issue: net: work around runtime scheduler starvation on js and wasip1

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/557038 mentions this issue: time: add a regression test for time.AfterFunc goroutine starvation

@prattmic
Copy link
Member

prattmic commented Jan 19, 2024

At a theoretical level, this doesn't surprise me, and I would expect it to impact linux-amd64 GOMAXPROCS=1.

This test always has at least one runnable goroutine (one of the channel send/recv goroutines). In fact, because the channel is buffered in theory both goroutines may be runnable all of the time.

When the timer fires, a third goroutine will become runnable [1].

We have one thread on which to run three goroutines, so there must be some scheduling delay. Go does not have a real-time scheduler, or even a best-effort deadline-based scheduler, so it can theoretically have unbounded scheduling delay.

Readying a goroutine typically puts it on the head of the run queue to get it to run soon. But since that applies to almost every ready, the timer firing could put goroutine 3 on the head of the run queue, and then goroutine 1 or 2 could replace the head before the scheduler runs again.

If you dump the /sched/latencies:seconds histogram from runtime/metrics, I'd expect to see a long tail of long scheduling delays.

There are certainly many improvements that could go into the scheduler to try to make this situation better (#51071 comes to mind).

The more interesting case would be if this is actually completely stuck. i.e., the timer expiration or ready goroutine was somehow lost altogether. Figuring that out by running a test is a bit of a halting problem situation, but looking at runtime/metrics is probably a good start. Do we have many cases of long scheduling delay (but not enough to cause a timeout)? Or do things seem totally fine until one case gets stuck forever?

[1] We always run timers before looking for other work in the scheduler, so the timer should always fire. But an AfterFunc timer just starts a goroutine to run the func, which is the third goroutine.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 19, 2024

I agree that the runtime is theoretically allowed to produce this behavior according to the language spec.

However, the addition of non-cooperative preemption (#24543) set up a strong expectation that every runnable goroutine will eventually be scheduled, and resolved a large number of previous scheduler fairness issues (linked from that proposal).

We also explicitly made mutex locking starvation-free in #13086. To me, that suggests that this sort of goroutine starvation is not intended behavior.

This sort of starvation shows up in realistic programs that use time.AfterFunc to interrupt two or more running goroutines, as might be the case in a benchmark or stress-test. (See, for example, #65177.) It is sometimes possible to work around that starvation by making explicit calls to runtime.Gosched (as in https://go.dev/cl/557037), but arguably that's exactly the sort of explicit scheduler management that #24543 was intended to address.

@bcmills bcmills changed the title runtime: scheduler sometimes starves a runnable goroutine on wasip1/wasm runtime: scheduler sometimes starves a runnable goroutine on wasm platforms Jan 19, 2024
gopherbot pushed a commit that referenced this issue Jan 19, 2024
Fixes #65177.
Updates #65178.
Updates #64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Damien Neil <dneil@google.com>
@prattmic
Copy link
Member

I think I poorly worded my previous message and came across too strongly. I don't think this is good behavior, I meant that there are two broad classes this problem may be in:

  1. There is nothing fundamentally wrong, but the existing design is producing poor behavior.
  2. Something is fundamentally broken, and this wakeup is lost/gone and no amount of waiting would resolve it.

@bcmills
Copy link
Contributor Author

bcmills commented Jan 20, 2024

Ah, yeah. I think the evidence so far supports (1) — the goroutine dumps from the timed-out tests list the AfterFunc goroutine as runnable, it just isn't actually being run.

(The fact that it is listed as runnable, and the fact that adding an explicit runtime.Gosched in https://go.dev/cl/557037 seems to fix the starvation there, suggest that the goroutine is successfully being marked schedulable, it just isn't getting scheduled.)

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

I thought this could be working on amd64 because the goroutines are being non-cooperatively preempted by GC which could randomly restart execution. But trying out with GOGC=off GOMAXPROCS=1 doesn't let me reproduce.

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

Even more minified example reproduce for me on wasip1:

 func TestAfterFuncStarvation(t *testing.T) {
 	var (
 		wg sync.WaitGroup
 		stop atomic.Bool
-		c = make(chan bool, 1)
+		c = make(chan bool)
 	)
 
 	wg.Add(2)
 	go func() {
 		for !stop.Load() {
 			c <- true
 		}
 		close(c)
 		wg.Done()
 	}()
 	go func() {
 		for range c {}
 		wg.Done()
 	}()
 
 	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
 	wg.Wait()
 }

It wouldn't reliably fail with -count=1000 anymore however, but with -count=10000 it would (tested 10/10 times), I don't know what this could mean however.

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

Even more surprisingly this works on GOARCH=amd64 GOMAXPROCS=1 (even with GOGC=off):

func TestAfterFuncStarvation2(t *testing.T) {
	var stop atomic.Bool
	AfterFunc(1 * Microsecond, func() { stop.Store(true) })
	for !stop.Load() {}
}

But expectedly never works on wasip1.

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

I am able to repro on GOARCH=amd64 GOMAXPROCS=1 by adding && false this if clause:

if GOARCH != "wasm" { // no threads on wasm yet, so no sysmon

-count=10000 works in 0.173s but -count=100000 timeout with equivalent stack traces.

This explains why wasm is so special here, other arches have a watchdog.

gopherbot pushed a commit that referenced this issue Jan 22, 2024
The test is skipped on wasm platforms for now, because it
successfully detects a starvation bug on those platforms.

For #65178.

Change-Id: I05d28f1c7be99fcab67ec4dfaa38f412e11fd3cb
Reviewed-on: https://go-review.googlesource.com/c/go/+/557038
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Damien Neil <dneil@google.com>
@prattmic
Copy link
Member

Thanks for digging @Jorropo! sysmon preempts long-running goroutines, so without that preemption, the tight stop.Load() loop won't ever yield. That said, even with sysmon, the right combinations of goroutines should still theoretically be able to trigger long wait times.

Still, the lack of sysmon on wasm makes me wonder if we should be more aggressive about self-preemption there.

@Jorropo
Copy link
Member

Jorropo commented Jan 22, 2024

I had the same question, I don't see how to solve the tight loop without the compiler aggressively injecting runtime.Gosched() in loops (maybe with a local counter to only yield after X iterations ? or based on monotonic time if this is cheap ?).

I find it interesting that the original example repro on amd64 after I disable sysmon.
I havn't looked at the scheduler code in a while, but if I remember there is a special spot in the worklist for a hot goroutine, could it be that the two goroutines keep exchanging themselves from the hot slot ?
I guess after every 4096 schedule (or whatever) should be based on monotonic time, we should pick a random goroutine in our work queue instead of the hot one (or whatever is the first one in the queue).

@aclements
Copy link
Member

I guess after every 4096 schedule (or whatever) should be based on monotonic time, we should pick a random goroutine in our work queue instead of the hot one (or whatever is the first one in the queue).

On non-wasm platforms, we already do this using the "inheritTime" result from runqget. What's supposed to be happening is that goroutines can switch off in runnext (the "hot slot") but they share a single time slice, and will get preempted for another goroutine if that time slice runs out. But sysmon is responsible for time slice preemption. You're right that without that, things can be arbitrarily unfair.

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/557615 mentions this issue: http2: avoid goroutine starvation in TestServer_Push_RejectAfterGoAway

gopherbot pushed a commit to golang/net that referenced this issue Jan 22, 2024
CL 557037 added a runtime.Gosched to prevent goroutine starvation
in the wasm fake-net stack. Unfortunately, that Gosched causes the
scheduler to enter a very similar starvation loop in this test.

Add another runtime.Gosched to break this new loop.

For golang/go#65178.

Change-Id: I24b3f50dd728800462f71f27290b0d8f99d5ae5b
Cq-Include-Trybots: luci.golang.try:x_net-gotip-wasip1-wasm_wasmtime,x_net-gotip-wasip1-wasm_wazero,x_net-gotip-js-wasm
Reviewed-on: https://go-review.googlesource.com/c/net/+/557615
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/557437 mentions this issue: runtime: disable use of runnext on wasm

@prattmic
Copy link
Member

With https://go.dev/cl/557437, the original repro no longer hangs. This does not do anything to impact goroutines that simply never yield at all, like #65178 (comment).

@bcmills
Copy link
Contributor Author

bcmills commented Jan 22, 2024

That sounds great! I think it's reasonable to only fix goroutines that actually yield for now, since that sort of live-lock is relatively much easier to diagnose. (For the single-goroutine case, the live-locked goroutine will actually be running in a goroutine dump, as opposed to “running some of the time”.)

@cherrymui cherrymui added the NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one. label Jan 23, 2024
@cherrymui cherrymui added this to the Backlog milestone Jan 23, 2024
@mknyszek mknyszek moved this from Todo to In Progress in Go Compiler / Runtime Jan 24, 2024
@gopherbot
Copy link
Contributor

Found new dashboard test flakes for:

#!watchflakes
post <- goarch == "wasm" && pkg == "golang.org/x/net/http2" && test == "TestServer_Push_RejectAfterGoAway" && `timeout`
2024-01-10 15:35 x_net-gotip-js-wasm net@07e05fd6 go@6e7aee5f x/net/http2.TestServer_Push_RejectAfterGoAway (log)
=== RUN   TestServer_Push_RejectAfterGoAway
    server_push_test.go:517: timeout waiting for GOAWAY to be processed
    server_test.go:265: Framer read log:
        2024-01-22 21:12:19.062 Framer 0x1b73340: read SETTINGS len=30, settings: MAX_FRAME_SIZE=1048576, MAX_CONCURRENT_STREAMS=250, MAX_HEADER_LIST_SIZE=1048896, HEADER_TABLE_SIZE=4096, INITIAL_WINDOW_SIZE=1048576
        2024-01-22 21:12:19.063 Framer 0x1b73340: read WINDOW_UPDATE len=4 (conn) incr=983041
        2024-01-22 21:12:19.063 Framer 0x1b73340: read SETTINGS flags=ACK len=0
    server_test.go:271: Framer write log:
        2024-01-22 21:12:19.059 Framer 0x1b73340: wrote SETTINGS len=0
        2024-01-22 21:12:19.062 Framer 0x1b73340: wrote SETTINGS flags=ACK len=0
        2024-01-22 21:12:19.063 Framer 0x1b73340: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=14
        2024-01-22 21:12:19.063 Framer 0x1b73340: wrote GOAWAY len=8 LastStreamID=1 ErrCode=NO_ERROR Debug=""
--- FAIL: TestServer_Push_RejectAfterGoAway (5.08s)
2024-01-10 15:35 x_net-gotip-wasip1-wasm_wasmtime net@07e05fd6 go@6e7aee5f x/net/http2.TestServer_Push_RejectAfterGoAway (log)
=== RUN   TestServer_Push_RejectAfterGoAway
    server_push_test.go:517: timeout waiting for GOAWAY to be processed
    server_test.go:265: Framer read log:
        2024-01-22 21:13:09.783785341 Framer 0x1b57340: read SETTINGS len=30, settings: MAX_FRAME_SIZE=1048576, MAX_CONCURRENT_STREAMS=250, MAX_HEADER_LIST_SIZE=1048896, HEADER_TABLE_SIZE=4096, INITIAL_WINDOW_SIZE=1048576
        2024-01-22 21:13:09.784986297 Framer 0x1b57340: read WINDOW_UPDATE len=4 (conn) incr=983041
        2024-01-22 21:13:09.785064437 Framer 0x1b57340: read SETTINGS flags=ACK len=0
    server_test.go:271: Framer write log:
        2024-01-22 21:13:09.758268841 Framer 0x1b57340: wrote SETTINGS len=0
        2024-01-22 21:13:09.783854045 Framer 0x1b57340: wrote SETTINGS flags=ACK len=0
        2024-01-22 21:13:09.785151616 Framer 0x1b57340: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=14
        2024-01-22 21:13:09.785232497 Framer 0x1b57340: wrote GOAWAY len=8 LastStreamID=1 ErrCode=NO_ERROR Debug=""
--- FAIL: TestServer_Push_RejectAfterGoAway (5.22s)
2024-01-10 15:35 x_net-gotip-wasip1-wasm_wazero net@07e05fd6 go@6e7aee5f x/net/http2.TestServer_Push_RejectAfterGoAway (log)
=== RUN   TestServer_Push_RejectAfterGoAway
    server_push_test.go:517: timeout waiting for GOAWAY to be processed
    server_test.go:265: Framer read log:
        2024-01-22 21:12:23.865547127 Framer 0x1b59340: read SETTINGS len=30, settings: MAX_FRAME_SIZE=1048576, MAX_CONCURRENT_STREAMS=250, MAX_HEADER_LIST_SIZE=1048896, HEADER_TABLE_SIZE=4096, INITIAL_WINDOW_SIZE=1048576
        2024-01-22 21:12:23.867103713 Framer 0x1b59340: read WINDOW_UPDATE len=4 (conn) incr=983041
        2024-01-22 21:12:23.86719788 Framer 0x1b59340: read SETTINGS flags=ACK len=0
    server_test.go:271: Framer write log:
        2024-01-22 21:12:23.861394123 Framer 0x1b59340: wrote SETTINGS len=0
        2024-01-22 21:12:23.865599663 Framer 0x1b59340: wrote SETTINGS flags=ACK len=0
        2024-01-22 21:12:23.86732396 Framer 0x1b59340: wrote HEADERS flags=END_STREAM|END_HEADERS stream=1 len=14
        2024-01-22 21:12:23.867434532 Framer 0x1b59340: wrote GOAWAY len=8 LastStreamID=1 ErrCode=NO_ERROR Debug=""
--- FAIL: TestServer_Push_RejectAfterGoAway (5.12s)

watchflakes

gopherbot pushed a commit that referenced this issue Jan 29, 2024
When readying a goroutine, the scheduler typically places the readied
goroutine in pp.runnext, which will typically be the next goroutine to
run in the schedule.

In order to prevent a set of ping-pong goroutines from simply switching
back and forth via runnext and starving the rest of the run queue, a
goroutine scheduled via runnext shares a time slice (pp.schedtick) with
the previous goroutine.

sysmon detects "long-running goroutines", which really means Ps using
the same pp.schedtick for too long, and preempts them to allow the rest
of the run queue to run. Thus this avoids starvation via runnext.

However, wasm has no threads, and thus no sysmon. Without sysmon to
preempt, the possibility for starvation returns. Avoid this by disabling
runnext entirely on wasm. This means that readied goroutines always go
on the end of the run queue and thus cannot starve via runnext.

Note that this CL doesn't do anything about single long-running
goroutines. Without sysmon to preempt them, a single goroutine that
fails to yield will starve the run queue indefinitely.

For #65178.

Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Change-Id: I7dffe1e72c6586474186b72f8068cff77b661eae
Reviewed-on: https://go-review.googlesource.com/c/go/+/557437
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/559237 mentions this issue: Revert "runtime: disable use of runnext on wasm"

gopherbot pushed a commit that referenced this issue Jan 30, 2024
This reverts CL 557437.

Reason for revert: Appears to have broken wasip1 builders.

For #65178.

Change-Id: I59c1a310eb56589c768536fe444c1efaf862f8b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/559237
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/559835 mentions this issue: net/http: set a client timeout in TestServerWriteTimeout

@gopherbot
Copy link
Contributor

Change https://go.dev/cl/559798 mentions this issue: runtime: disable use of runnext on wasm

ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
Fixes golang#65177.
Updates golang#65178.
Updates golang#64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Damien Neil <dneil@google.com>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
The test is skipped on wasm platforms for now, because it
successfully detects a starvation bug on those platforms.

For golang#65178.

Change-Id: I05d28f1c7be99fcab67ec4dfaa38f412e11fd3cb
Reviewed-on: https://go-review.googlesource.com/c/go/+/557038
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Damien Neil <dneil@google.com>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
When readying a goroutine, the scheduler typically places the readied
goroutine in pp.runnext, which will typically be the next goroutine to
run in the schedule.

In order to prevent a set of ping-pong goroutines from simply switching
back and forth via runnext and starving the rest of the run queue, a
goroutine scheduled via runnext shares a time slice (pp.schedtick) with
the previous goroutine.

sysmon detects "long-running goroutines", which really means Ps using
the same pp.schedtick for too long, and preempts them to allow the rest
of the run queue to run. Thus this avoids starvation via runnext.

However, wasm has no threads, and thus no sysmon. Without sysmon to
preempt, the possibility for starvation returns. Avoid this by disabling
runnext entirely on wasm. This means that readied goroutines always go
on the end of the run queue and thus cannot starve via runnext.

Note that this CL doesn't do anything about single long-running
goroutines. Without sysmon to preempt them, a single goroutine that
fails to yield will starve the run queue indefinitely.

For golang#65178.

Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Change-Id: I7dffe1e72c6586474186b72f8068cff77b661eae
Reviewed-on: https://go-review.googlesource.com/c/go/+/557437
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Mauri de Souza Meneguzzo <mauri870@gmail.com>
Reviewed-by: Michael Knyszek <mknyszek@google.com>
Reviewed-by: Bryan Mills <bcmills@google.com>
ezz-no pushed a commit to ezz-no/go-ezzno that referenced this issue Feb 18, 2024
This reverts CL 557437.

Reason for revert: Appears to have broken wasip1 builders.

For golang#65178.

Change-Id: I59c1a310eb56589c768536fe444c1efaf862f8b0
Reviewed-on: https://go-review.googlesource.com/c/go/+/559237
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Michael Pratt <mpratt@google.com>
Reviewed-by: Dmitri Shuralyov <dmitshur@golang.org>
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
@bcmills
Copy link
Contributor Author

bcmills commented Feb 22, 2024

@gopherbot, please backport to Go 1.22. The additional tests enabled in https://go.dev/cl/526117 made it more likely for this bug to be exposed in test runs, which may cause flakiness in pre-release (and backport CL) testing.

@gopherbot
Copy link
Contributor

Backport issue(s) opened: #65883 (for 1.22).

Remember to create the cherry-pick CL(s) as soon as the patch is submitted to master, according to https://go.dev/wiki/MinorReleases.

@gopherbot gopherbot added the Testing An issue that has been verified to require only test changes, not just a test failure. label Feb 22, 2024
@gopherbot
Copy link
Contributor

Change https://go.dev/cl/566175 mentions this issue: [release-branch.go1.22] net: work around runtime scheduler starvation on js and wasip1

@bcmills bcmills removed the Testing An issue that has been verified to require only test changes, not just a test failure. label Feb 22, 2024
gopherbot pushed a commit that referenced this issue Feb 26, 2024
This breaks an unbounded client-side retry loop if the server's
timeout happens to fire during its final read of the TLS handshake.

The retry loop was observed on wasm platforms at CL 557437.
I was also able to reproduce chains of dozens of retries on my
linux/amd64 workstation by adjusting some timeouts and adding a couple
of sleeps, as in this patch:
https://gist.github.com/bcmills/d0a0a57e5f64eebc24e8211d8ea502b3
However, on linux/amd64 on my workstation the test always eventually
breaks out of the retry loop due to timing jitter.

I couldn't find a retry-specific hook in the http.Client,
http.Transport, or tls.Config structs, so I have instead abused the
Transport.Proxy hook for this purpose. Separately, we may want to
consider adding a retry-specific hook, or changing the net/http
implementation to avoid transparently retrying in this case.

Fixes #65410.
Updates #65178.

Change-Id: I0e43c039615fe815f0a4ba99a8813c48b1fdc7e6
Reviewed-on: https://go-review.googlesource.com/c/go/+/559835
Reviewed-by: Damien Neil <dneil@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
gopherbot pushed a commit that referenced this issue Feb 27, 2024
When readying a goroutine, the scheduler typically places the readied
goroutine in pp.runnext, which will typically be the next goroutine to
run in the schedule.

In order to prevent a set of ping-pong goroutines from simply switching
back and forth via runnext and starving the rest of the run queue, a
goroutine scheduled via runnext shares a time slice (pp.schedtick) with
the previous goroutine.

sysmon detects "long-running goroutines", which really means Ps using
the same pp.schedtick for too long, and preempts them to allow the rest
of the run queue to run. Thus this avoids starvation via runnext.

However, wasm has no threads, and thus no sysmon. Without sysmon to
preempt, the possibility for starvation returns. Avoid this by disabling
runnext entirely on wasm. This means that readied goroutines always go
on the end of the run queue and thus cannot starve via runnext.

Note that this CL doesn't do anything about single long-running
goroutines. Without sysmon to preempt them, a single goroutine that
fails to yield will starve the run queue indefinitely.

For #65178.

Change-Id: I10859d088776125a2af8c9cd862b6e071da628b5
Cq-Include-Trybots: luci.golang.try:gotip-js-wasm,gotip-wasip1-wasm_wasmtime,gotip-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/559798
Auto-Submit: Bryan Mills <bcmills@google.com>
Reviewed-by: Michael Pratt <mpratt@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
gopherbot pushed a commit that referenced this issue Feb 28, 2024
… on js and wasip1

For #65883.
Updates #65177.
Updates #65178.
Updates #64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Damien Neil <dneil@google.com>
(cherry picked from commit f19f31f)
Reviewed-on: https://go-review.googlesource.com/c/go/+/566175
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
@bcmills
Copy link
Contributor Author

bcmills commented Mar 4, 2024

@prattmic, with https://go.dev/cl/559798 merged, should we close this issue? (Or is there more to do?)

@bcmills
Copy link
Contributor Author

bcmills commented Mar 4, 2024

I guess this doesn't address the case where a single long-running goroutine never yields. Not sure whether we want to consider that in-scope for this issue, or treat it as a separate feature request to support goroutine preemption on js/wasm. 🤔

bradfitz pushed a commit to tailscale/go that referenced this issue Mar 5, 2024
… on js and wasip1

For golang#65883.
Updates golang#65177.
Updates golang#65178.
Updates golang#64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Damien Neil <dneil@google.com>
(cherry picked from commit f19f31f)
Reviewed-on: https://go-review.googlesource.com/c/go/+/566175
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
romaindoumenc pushed a commit to TroutSoftware/go that referenced this issue Mar 6, 2024
… on js and wasip1

For golang#65883.
Updates golang#65177.
Updates golang#65178.
Updates golang#64321.

Change-Id: I698fd3b688c7dfbde692eb7c29cbdafc89e7ca32
Cq-Include-Trybots: luci.golang.try:go1.22-js-wasm,go1.22-wasip1-wasm_wasmtime,go1.22-wasip1-wasm_wazero
Reviewed-on: https://go-review.googlesource.com/c/go/+/557037
Auto-Submit: Bryan Mills <bcmills@google.com>
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Damien Neil <dneil@google.com>
(cherry picked from commit f19f31f)
Reviewed-on: https://go-review.googlesource.com/c/go/+/566175
Reviewed-by: Dmitri Shuralyov <dmitshur@google.com>
@prattmic
Copy link
Member

prattmic commented Mar 7, 2024

I think that should be a separate issue, or at least we should rename this to be more clearly about async preemption.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-wasm WebAssembly issues compiler/runtime Issues related to the Go compiler and/or runtime. NeedsInvestigation Someone must examine and confirm this is a valid issue and not a duplicate of an existing one.
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

6 participants