Skip to content

JIT: widening stack memory loads can cause store-forward stalls #85957

Open
@AndyAyersMS

Description

@AndyAyersMS

The JIT will widen stack loads of small-typed locals. When one of those wide loads appears closely after a small store to the same local, it can cause a lengthy store-forward stall.

One such example is seen in the System.Buffers.Text.Tests.Utf8FormatterTests.FormatterUInt64(value: 0) benchmark when PGO is enabled (see #84264 (comment))

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0    // narrow store (struct init)
       mov      eax, dword ptr [rsp+48H]  // wide load
       or       al, byte  ptr [rsp+49H]
       jne      G_M000_IG14

This transformation comes about in morph, as part of fgMorphCastedBitwiseOp:

fgMorphTree BB06, STMT00014 (before)
               [000065] --C--------                         *  JTRUE     void  
               [000064] --C--------                         \--*  EQ        int   
               [000161] -----------                            +--*  EQ        int   
               [000159] -----------                            |  +--*  OR        int   
               [000155] -----------                            |  |  +--*  LCL_VAR   ubyte (AX) V58 tmp54        
               [000158] -----------                            |  |  \--*  LCL_VAR   ubyte (AX) V59 tmp55        
               [000160] -----------                            |  \--*  CNS_INT   int    0
               [000063] -----------                            \--*  CNS_INT   int    0

fgMorphTree BB06, STMT00014 (after)
               [000065] ----G+-----                         *  JTRUE     void  
               [000161] J---G+-N---                         \--*  NE        int   
               [000624] ----G+-----                            +--*  CAST      int <- ubyte <- int
               [000159] ----G------                            |  \--*  OR        int   
               [000155] ----G+-----                            |     +--*  LCL_VAR   int   (AX) V58 tmp54        
               [000158] ----G+-----                            |     \--*  LCL_VAR   int   (AX) V59 tmp55        
               [000160] -----+-----                            \--*  CNS_INT   int    0

In the case cited above this causes a roughly 3x slowdown in the test, and a point fix that disables this for normalize on load locals recovers the missing perf.

This particular pattern only arises with PGO, as most of this method is cold and the struct backing V58/V59 is left exposed by some calls in cold blocks that don't get inlined. But the same thing could happen even without PGO.

G_M000_IG05:                ;; offset=0035H
       mov      word  ptr [rsp+48H], 0
       movzx    rax, byte  ptr [rsp+48H]   // narrow load, extended
       movzx    rcx, byte  ptr [rsp+49H]
       or       eax, ecx
       jne      G_M000_IG14

I played around with a broader fix, modifying xarch's genCodeForLclVar to do narrower loads for normalize on load locals, and that lead to quite a few diffs. At least in my limited checking I only saw diffs in Tier0 code. So that would suggest that in optimized code we don't run into this all that often, but it can happen.

cc @dotnet/jit-contrib

Metadata

Metadata

Assignees

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions