Remove AVX/SSE transition penalties #8588

litian2025 · 2016-12-12T05:09:45Z

There are two two kinds of transition penalties:
1.Transition from 256-bit AVX code to 128-bit legacy SSE code.
2.Transition from 128-bit legacy SSE code to either 128 or 256-bit AVX code. This only happens if there was a preceding AVX256->legacy SSE transition penalty.

The primary goal is to remove the AVX to SSE transition penalty.

Added two emitter flags: contains256bitAVXInstruction indicates that if the JIT method contains 256-bit AVX code, containsAVXInstruction indicates that if the method contains 128-bit or 256-bit AVX code.

Issue VZEROUPPER in Prolog if the method contains 128-bit or 256-bit AVX code, to avoid legacy SSE to AVX transition penalty, this could happen for reverse pinvoke situation. Issue VZEROUPPER in Epilog
if the method contains 256-bit AVX code, to avoid AVX to legacy SSE transition penalty.

To limit code size increase impact, we only issue VZEROUPPER before PInvoke call on user defined function if the JIT method contains 256-bit AVX code, assuming user defined function contains legacy
SSE code. No need to issue VZEROUPPER after PInvoke call because SSE to AVX transition penalty won't happen since AVX to SSE transition has been taken care of before the PInvoke call.

We measured ~3% to 1% performance gain on TechEmPower plaintext and verified those VTune AVX/SSE events: OTHER_ASSISTS.AVX_TO_SSE and OTHER_ASSISTS.SSE_TO_AVE have been reduced to 0.

JITUtils reports no code size increase on Linux Scenario 1 (Running the mscorlib and frameworks diffs using just the assemblies made available by jitutils) and Scenario 2 (Running mscorlib, frameworks, and test assets diffs using the resources generated for a CoreCLR test run), further analysis on the dasm code appears verified the report, since no AVX instruction are present on the dasm code, thus no additional VZEROUPPER instruction are inserted.

Fix #7240

litian2025 · 2016-12-12T05:10:16Z

@CarolEidt PTAL

CarolEidt · 2016-12-13T02:37:10Z

src/jit/codegencommon.cpp

+                instGen(INS_vzeroupper);
+                bVzeroupperIssued = true;
+            }
+#endif


I'm wondering why we would have the case where we have no AVX instructions emitted (i.e. we didn't issue the vzeroupper above), but we have an AVX copy instruction. That is, if we are using AVX encodings and have no AVX instructions issued, then how can we be using callee save floating point registers?

If we have AVX copy instruction such as vmovupd, it is 128bit AVX instruction, the verzoupper instruction needs to be issued before the AVX copy instruction (vmovupd etc.) to avoid SSE->AVX transition penalty

CarolEidt · 2016-12-13T02:40:32Z

src/jit/codegencommon.cpp

+        // reset both AVX and 256bit AVX flags for the function before return
+        getEmitter()->SetContainsAVX(false);
+        getEmitter()->SetContains256bitAVX(false);
+#endif
        return;


Rather than duplicate this code, perhaps this could be factored out into a small method, e.g. CodeGen::genVZeroUpperIfNeeded.
Also, why are the flags reset before returning?

Yes, will refactor it into a small method. Those flags does not need to be reset since they have been reset already at Compiler::compSetProcessor().

CarolEidt · 2016-12-13T02:44:00Z

src/jit/emitxarch.cpp

@@ -2315,6 +2315,8 @@ void emitter::emitIns(instruction ins)
    id->idInsFmt(fmt);
    id->idCodeSize(sz);

+    emitAttr attr = id->idOpSize();
+    CheckAndSetContainsAVX(ins, attr);
    dispIns(id);


Adding this code in all these methods seems unnecessary. I believe there are methods that are always called to encode an AVX instruction, and we should attempt to localize this checking there.

Move these to Lower.

CarolEidt · 2016-12-13T02:47:27Z

src/jit/codegenxarch.cpp

+    if (call->IsPInvoke() && call->gtCallType == CT_USER_FUNC && compiler->getFloatingPointInstructionSet() == InstructionSet_AVX)
+    {
+        if (getEmitter()->Contains256bitAVX())
+        {


Note that if this method is called in a loop, and there are 256-bit AVX instructions emitted after this call, the vzeroupper won't be emitted.

Another way to detect whether method has any 256-bit AVX instructions or 128/256-bit instructions, is to check whether any of gentree nodes are of SIMD vector types during lower register specification (lowerxarch.cpp:TreeNodeInfoInit()).

Contains256BitAvxInstr = (getSIMDInstructionSet() == InstructionSet_AVX) and method has TYP_SIMD32 GenTree nodes as indicated by Lower register specification
ContainsAVXInstr = (getSIMDInstructionSet() == InstructionSet_AVX) and method has TYP_SIMD* GenTree nodes as indicated by Lower register specification.

This info could be used by codegen to emit vzeroupper as required.

Note that Lower register specification detecting the usage of AVX 128/256-bit instructions also solves the problem that Carol pointed out above, because Lower Regiter specification runs before Codegen phase.

The method that can set a flag to indicate whether AVX 128/256-bit instructions will be used is lowerxarch.cpp!Lowering:TreeNodeInfoInitSIMD()

Thanks a lot for the inputs, I will give it a try and update the PR.

@sivarv great suggestions!

CarolEidt · 2016-12-13T02:47:43Z

@sivarv PTAL
cc @dotnet/jit-contrib

CarolEidt · 2017-01-05T23:25:04Z

src/jit/codegencommon.cpp

+//
+// Params
+//   is256bitAVX - Flag to check if the function contains 256-bit AVX instruction and generate Vzeroupper
+//      instruction, otherwise check if the function contains AVX instruciton (either 128-bit or 256-bit). 


This parameter name is a little confusing. It might be better to call this check256bitOnly to make it clear that this is called when we only need the vzeroupper if we've used 256-bit AVX.

Good suggestion, will change it accordingly

CarolEidt · 2017-01-05T23:25:38Z

src/jit/codegencommon.cpp

@@ -10582,6 +10583,18 @@ void CodeGen::genPreserveCalleeSavedFltRegs(unsigned lclFrameSize)
        regMaskTP regBit = genRegMask(reg);
        if ((regBit & regMask) != 0)
        {
+#ifdef FEATURE_AVX_SUPPORT
+            // when we reach here, function does not contains AVX instruction so far, however, since copyIns can


Typo: Should be "does not contain"

Thanks, will change it.

CarolEidt · 2017-01-05T23:35:12Z

src/jit/lowerxarch.cpp

+        }
+        else if (simdTree->gtSIMDSize == 16)
+        {
+            comp->getEmitter()->SetContainsAVX(true);


Suggestion: this could be factored out as it is always done. Also, if we have any SIMD node, then we will be using AVX-encoded instructions. Note that the gtSIMDSize could be 8 or 12 for a fixed-size vector type. I think that you should unconditionally call SetContainsAVX(true) and then only call SetContains256bitAVX(true) for the 32 byte size.

Thanks a lot for the suggestion, will make the change and update the PR.

There are two two kinds of transition penalties: 1.Transition from 256-bit AVX code to 128-bit legacy SSE code. 2.Transition from 128-bit legacy SSE code to either 128 or 256-bit AVX code. This only happens if there was a preceding AVX256->legacy SSE transition penalty. The primary goal is to remove the dotnet#1 AVX to SSE transition penalty. Added two emitter flags: contains256bitAVXInstruction indicates that if the JIT method contains 256-bit AVX code, containsAVXInstruction indicates that if the method contains 128-bit or 256-bit AVX code. Issue VZEROUPPER in prolog if the method contains 128-bit or 256-bit AVX code, to avoid legacy SSE to AVX transition penalty, this could happen for reverse pinvoke situation. Issue VZEROUPPER in epilog if the method contains 256-bit AVX code, to avoid AVX to legacy SSE transition penalty. To limite code size increase impact, we only issue VZEROUPPER before PInvoke call on user defined function if the JIT method contains 256-bit AVX code, assuming user defined function contains legacy SSE code. No need to issue VZEROUPPER after PInvoke call because dotnet#2 SSE to AVX transition penalty won't happen since dotnet#1 AVX to SSE transition has been taken care of before the PInvoke call. We measured ~3% to 1% performance gain on TechEmPower plaintext and verified those VTune AVX/SSE events: OTHER_ASSISTS.AVX_TO_SSE and OTHER_ASSISTS.SSE_TO_AVE have been reduced to 0. Fix #7240 move setContainsAVX flags to lower, refactor to a smaller method refactor, fix typo in comments fix format error

litian2025 · 2017-01-09T16:34:38Z

I updated the PR to address the feedback, rebased master, resolved conflicts and squashed the commits to 1. @CarolEidt please let me what else needs to be done, thanks.

CarolEidt · 2017-01-09T16:56:57Z

Looks good to me. I would like @sivarv to have a look before we merge. Thanks!

sivarv · 2017-01-09T18:31:39Z

src/jit/lowerxarch.cpp

@@ -4573,6 +4579,32 @@ void Lowering::SetMulOpCounts(GenTreePtr tree)
 }

 //------------------------------------------------------------------------------
+// SetContainsAVXFlags: default value of isFloatingType is true, we set the
+// ContainsAVX flag when floating type value is true, when SIMD vector size is
+// 32 bytes, it is 256bit AVX instruction and we set Contains256bitAVX flag too


I found a bit difficult to parse this comment block. I would suggest to simplify it.

Will simplify it for something like: "Set ContainsAVX flag when it is floating type, set Contains256bitAVX flag when SIMD vector size is 32 bytes"

sivarv · 2017-01-09T18:34:37Z

src/jit/lowerxarch.cpp

+//    isFloatingType    - is floating type
+//    sizeOfSIMDVector  - SIMD Vector size
+//
+void Lowering::SetContainsAVXFlags(bool isFloatingType, unsigned sizeOfSIMDVector)


Suggestion: We can indicate the default values of params as follows:

void Lowering::SetContainsAVXFlags(bool isFloatingType /* = true /,
unsigned sizeOfSIMDVector / = 0 */)

sivarv · 2017-01-09T18:38:57Z

src/jit/lowerxarch.cpp

@@ -1954,6 +1957,8 @@ void Lowering::TreeNodeInfoInitBlockStore(GenTreeBlk* blkNode)
                    // series of 16-byte loads and stores.
                    blkNode->gtLsraInfo.internalFloatCount = 1;
                    blkNode->gtLsraInfo.addInternalCandidates(l, l->internalFloatRegCandidates());
+                    // use XMM register for load and store, need set the flag for AVX instruction


// use XMM register for load and store, need set the flag for AVX instruction [](start = 20, length = 77)

I would suggest to rephrase this comment as

"Uses XMM reg for load and store and hence check to see whether AVX instructions are used for codegen"

Will change it

sivarv · 2017-01-09T18:52:52Z

src/jit/codegencommon.cpp

+                instGen(INS_vzeroupper);
+                bVzeroupperIssued = true;
+            }
+#endif


Windows xarch has callee saved float regs whereas Unix x64 no callee saved regs.

On Windows, we would be saving callee saved float regs only if method has floating point or SIMD codegen. In such a case lowerxarch.cpp logic will set the required AVX flags on emitter and why do we need this logic here?

In other words, what is the case in which AVX flags are not set but method is attempting to save callee saved float regs to stack in prolog?

Fixed, given lower has set the AVX flags already, this can be removed. Also will change genVzeroupperIfNeeded() return value to void since we don't need to check the return value any more.

sivarv · 2017-01-10T22:17:23Z

src/jit/lowerxarch.cpp

+//    isFloatingType    - true if it is floating type
+//    sizeOfSIMDVector  - SIMD Vector size
+//
+void Lowering::SetContainsAVXFlags(bool isFloatingType /* = true */, unsigned sizeOfSIMDVector /* = 0*/)


isFloatingTyp [](start = 40, length = 13)

Please rename this as isFloatingPointType.
(IsFloatingType sounds a bit strange!)

sivarv · 2017-01-10T22:22:22Z

src/jit/codegenxarch.cpp

+    // if the function contains 256bit AVX instructions, this is to avoid AVX-256 to Legacy SSE
+    // transition penalty, assuming the user function contains legacy SSE instruction
+    if (call->IsPInvoke() && call->gtCallType == CT_USER_FUNC &&
+        compiler->getFloatingPointInstructionSet() == InstructionSet_AVX)


getFloatingPointInstructionSet [](start = 18, length = 30)

To be accurate, shouldn't this be getSIMDInstructionSet() instead?

We don't need to check instruction set here because emitter flag is set based on instruction set. Instead it would suffice to assert here as follows

if (call->IsPInvoke() && (call->gtCallType == CT_USER_FUNC) && getEmitter()->Contains256bitAVX())
{
assert(compiler->getFloatingPointInstructionSet() == InstructionSet_AVX);
instGen(INS_vzeroupper);
}

In reply to: 95471453 [](ancestors = 95471453)

sivarv · 2017-01-10T22:30:34Z

src/jit/codegencommon.cpp

@@ -10583,7 +10583,8 @@ GenTreePtr CodeGen::genMakeConst(const void* cnsAddr, var_types cnsType, GenTree
 //             funclet frames: this will be FuncletInfo.fiSpDelta.
 void CodeGen::genPreserveCalleeSavedFltRegs(unsigned lclFrameSize)
 {
-    regMaskTP regMask = compiler->compCalleeFPRegsSavedMask;
+    genVzeroupperIfNeeded(false);


genVzeroupperIfNeeded(false); [](start = 4, length = 29)

Question:

AFAIK from Intel manual one will incur transition penalty from legacy SSE (scalar or 128-bit) to AVX (128 or 256-bit) or vice versa.

Will there be a transition penalty on transitioning from legacy SSE (scalar or 128-bit) into scalar AVX or vice versa?

If the answer is No: then we don't need to track whether a method contains avx instructions. Rather whether it contains 128/256-bit instructions alone would be sufficient. In such a case we are unnecessarily issuing vzero upper in some cases in the prolog of the method.

If the answer is Yes: Before a user PInvoke call don't we need to issue a vzeroupper (in codegenxarch.cpp)? Right now that logic is issuing a vzeroupper only if the method has 256-bit AVX instructions.

The transition penalty happens on transition from 256 bit AVX to legacy SSE (case 1) and from Legacy SSE to 128bit or 256bit AVX (case 2), case 2 only happens if there was a preceding AVX256->legacy SSE transition penalty (case 1). So for the first part, the answer is Yes, there's transition penalty from legacy SSE to AVX (either 128bit or 256bit), RyuJIT generates 128bit and/or 256bit AVX instructions, we insert Vzeroupper in the prolog to avoid that. For the second part, the answer is Yes but penalty only occurs on 256bit AVX to legacy SSE, no transition penalty from 128bit AVX to legacy SSE, that's why the logic is only issuing a vzeroupper if the method has 256bit AVX instruction.

Consider a managed method is using scalar floating point operations but no SIMD (128 or 256-bit) and running on AVX hardware. JIT would produce scalar floating point AVX instructions for the method. In the method prolog we are generating vzeroupper because the logic in lowerxarch would set a flag on emitter to indicate, method uses AVX instructions (but no 256-bit).

One reason i could think of is reverse PInvoking such a managed method. The native code could be using legacy SSE (scalar or 128-bit) and calling to managed method could result in legacy SSE to scalar AVX transition. Can you please confirm that will incur transition penalty? It is not clear to me what Intel manual says in this regard.

In reply to: 95486702 [](ancestors = 95486702)

sivarv · 2017-01-10T22:30:44Z

src/jit/lowerxarch.cpp

+                comp->codeGen->getEmitter()->SetContains256bitAVX(true);
+            }
+        }
+    }


To be accurate I would suggest to write it as follows

if (isFloatingPointType)
{
if (comp->getFloatingPointInstructionSet() == InstructionSet_AVX)
{
comp->getEmitter()->setContainsAVX(true);
}

if (sizeOfSIMDVector == 32 && comp->getSIMDInstructionSet() == InstructionSet_AVX) { comp->codeGen->getEmitter()->SetContains256bitAVX(true); }

}

sivarv · 2017-01-11T18:34:06Z

src/jit/codegencommon.cpp

+{
+#ifdef FEATURE_AVX_SUPPORT
+    if (compiler->getSIMDInstructionSet() == InstructionSet_AVX)
+    {


Here too we don't need to check for instruction set because lower is making that check before setting flags on emitter. Instead we should assert.

bool emitVzeroUpper;
if (check256bitOnly)
{
emitVzeroUpper = getEmitter()->Contains256bitAVX();
}
else
{
emitVzeroUpper = getEmitter()->ContainsAVX();
}

if (emitVzeroUpper)
{
assert(compiler->getSIMDInstructionSet() == InstructionSet_AVX);
instGen(INS_vzeroupper);
}

changed accordingly.

sivarv · 2017-01-12T07:10:56Z

Looks good.

Remove AVX/SSE transition penalties Commit migrated from dotnet/coreclr@d187267

dnfclas added the cla-already-signed label Dec 12, 2016

CarolEidt reviewed Dec 13, 2016

View reviewed changes

CarolEidt reviewed Jan 5, 2017

View reviewed changes

litian2025 force-pushed the AVX_SSE branch from d50d9c8 to cc169ea Compare January 8, 2017 20:52

sivarv reviewed Jan 9, 2017

View reviewed changes

Li Tian added 2 commits January 10, 2017 13:42

remove unnecessary check in CalleeSavedFltRegs

933257e

fix format error

3fdc10a

sivarv reviewed Jan 10, 2017

View reviewed changes

Li Tian added 3 commits January 10, 2017 16:52

rename, use getSIMDInstructionSet()

c0627f8

fix comments, assertion failure in crossgen mscorlib

ec2f1de

fix format error

eecbbbf

sivarv reviewed Jan 11, 2017

View reviewed changes

Li Tian added 2 commits January 11, 2017 14:48

use assert insteaf of if statement

3d5e08f

Add more comments to document remove AVX/SSE transition penalty logic

0981bb4

fix format

b6d69f5

sivarv merged commit d187267 into dotnet:master Jan 12, 2017

litian2025 deleted the AVX_SSE branch January 12, 2017 16:16

karelz modified the milestone: 2.0.0 Aug 28, 2017

fiigii mentioned this pull request Dec 22, 2017

Implement simple Sse2 hardware instrinsics #15585

Merged

picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022

Merge pull request dotnet/coreclr#8588 from litian2025/AVX_SSE

fd1786a

Remove AVX/SSE transition penalties Commit migrated from dotnet/coreclr@d187267

Remove AVX/SSE transition penalties #8588

Remove AVX/SSE transition penalties #8588

Uh oh!

Conversation

litian2025 commented Dec 12, 2016

Uh oh!

litian2025 commented Dec 12, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CarolEidt commented Dec 13, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

litian2025 commented Jan 9, 2017

Uh oh!

CarolEidt commented Jan 9, 2017

Uh oh!

sivarv Jan 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sivarv Jan 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sivarv Jan 9, 2017 •

edited

Loading

sivarv Jan 9, 2017 •

edited

Loading