Description
The JIT will widen stack loads of small-typed locals. When one of those wide loads appears closely after a small store to the same local, it can cause a lengthy store-forward stall.
One such example is seen in the System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0)
benchmark when PGO is enabled (see #84264 (comment))
G_M000_IG05: ;; offset=0035H
mov word ptr [rsp+48H], 0 // narrow store (struct init)
mov eax, dword ptr [rsp+48H] // wide load
or al, byte ptr [rsp+49H]
jne G_M000_IG14
This transformation comes about in morph, as part of fgMorphCastedBitwiseOp
:
fgMorphTree BB06, STMT00014 (before)
[000065] --C-------- * JTRUE void
[000064] --C-------- \--* EQ int
[000161] ----------- +--* EQ int
[000159] ----------- | +--* OR int
[000155] ----------- | | +--* LCL_VAR ubyte (AX) V58 tmp54
[000158] ----------- | | \--* LCL_VAR ubyte (AX) V59 tmp55
[000160] ----------- | \--* CNS_INT int 0
[000063] ----------- \--* CNS_INT int 0
fgMorphTree BB06, STMT00014 (after)
[000065] ----G+----- * JTRUE void
[000161] J---G+-N--- \--* NE int
[000624] ----G+----- +--* CAST int <- ubyte <- int
[000159] ----G------ | \--* OR int
[000155] ----G+----- | +--* LCL_VAR int (AX) V58 tmp54
[000158] ----G+----- | \--* LCL_VAR int (AX) V59 tmp55
[000160] -----+----- \--* CNS_INT int 0
In the case cited above this causes a roughly 3x slowdown in the test, and a point fix that disables this for normalize on load locals recovers the missing perf.
This particular pattern only arises with PGO, as most of this method is cold and the struct backing V58/V59 is left exposed by some calls in cold blocks that don't get inlined. But the same thing could happen even without PGO.
G_M000_IG05: ;; offset=0035H
mov word ptr [rsp+48H], 0
movzx rax, byte ptr [rsp+48H] // narrow load, extended
movzx rcx, byte ptr [rsp+49H]
or eax, ecx
jne G_M000_IG14
I played around with a broader fix, modifying xarch's genCodeForLclVar
to do narrower loads for normalize on load locals, and that lead to quite a few diffs. At least in my limited checking I only saw diffs in Tier0 code. So that would suggest that in optimized code we don't run into this all that often, but it can happen.
cc @dotnet/jit-contrib