Description
findRunnable
makes a copy of the slice header of allp
because once findRunnable
drops its P a STW can change allp
without synchronization (changing length and possibly allocating a new backing array) via procresize
(GOMAXPROCS change).
"Possibly allocating a new backing array" is the problem, since allp
is simply a standard heap allocation. allpSnapshot
is on the system stack of an M without a P. This means that (a) STW can proceed without stopping the M in findRunnable
and (b) the GC will not scan allpSnapshot
. Thus, we could have this sequence:
- M1 copies the
allp
slice header toallpSnapshot
. - M1 drops its P.
- M2 calls runtime.GOMAXPROCS.
- STW need not stop M1.
- procresize reallocates the
allp
backing array. The old array is now only referenced by M1'sallpSnapshot
. - World restarts.
- M2 triggers a GC.
- STW need not stop M1.
- GC does not scan M1's system stack, so it does not find a reference to the old
allp
array. - Old
allp
array is freed. - Word restarts.
- M2 allocates something which happens to reuse the same memory as the old
allp
array, which zeroes it (and then maybe writes to it). - M1 reads from
allpSnapshot
, reading the now-clobbered array.
This is only possible if the GOMAXPROCS
increases beyond the initial startup value (to trigger reallocation).
This also requires M1 to run really slowly to lose the race. M2 needs to do multiple stop-the-worlds and run an entire GC all before M1 manages to finish using allpSnapshot
. That seems pretty far-fetched, but could be possible if the kernel deschedules M1.
cc @golang/runtime