Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Remove AVX/SSE transition penalties #8588

Merged
merged 9 commits into from
Jan 12, 2017
Merged

Conversation

litian2025
Copy link

There are two two kinds of transition penalties:
1.Transition from 256-bit AVX code to 128-bit legacy SSE code.
2.Transition from 128-bit legacy SSE code to either 128 or 256-bit AVX code. This only happens if there was a preceding AVX256->legacy SSE transition penalty.

The primary goal is to remove the AVX to SSE transition penalty.

Added two emitter flags: contains256bitAVXInstruction indicates that if the JIT method contains 256-bit AVX code, containsAVXInstruction indicates that if the method contains 128-bit or 256-bit AVX code.

Issue VZEROUPPER in Prolog if the method contains 128-bit or 256-bit AVX code, to avoid legacy SSE to AVX transition penalty, this could happen for reverse pinvoke situation. Issue VZEROUPPER in Epilog
if the method contains 256-bit AVX code, to avoid AVX to legacy SSE transition penalty.

To limit code size increase impact, we only issue VZEROUPPER before PInvoke call on user defined function if the JIT method contains 256-bit AVX code, assuming user defined function contains legacy
SSE code. No need to issue VZEROUPPER after PInvoke call because SSE to AVX transition penalty won't happen since AVX to SSE transition has been taken care of before the PInvoke call.

We measured ~3% to 1% performance gain on TechEmPower plaintext and verified those VTune AVX/SSE events: OTHER_ASSISTS.AVX_TO_SSE and OTHER_ASSISTS.SSE_TO_AVE have been reduced to 0.

JITUtils reports no code size increase on Linux Scenario 1 (Running the mscorlib and frameworks diffs using just the assemblies made available by jitutils) and Scenario 2 (Running mscorlib, frameworks, and test assets diffs using the resources generated for a CoreCLR test run), further analysis on the dasm code appears verified the report, since no AVX instruction are present on the dasm code, thus no additional VZEROUPPER instruction are inserted.

Fix #7240

@litian2025
Copy link
Author

@CarolEidt PTAL

instGen(INS_vzeroupper);
bVzeroupperIssued = true;
}
#endif

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering why we would have the case where we have no AVX instructions emitted (i.e. we didn't issue the vzeroupper above), but we have an AVX copy instruction. That is, if we are using AVX encodings and have no AVX instructions issued, then how can we be using callee save floating point registers?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have AVX copy instruction such as vmovupd, it is 128bit AVX instruction, the verzoupper instruction needs to be issued before the AVX copy instruction (vmovupd etc.) to avoid SSE->AVX transition penalty

// reset both AVX and 256bit AVX flags for the function before return
getEmitter()->SetContainsAVX(false);
getEmitter()->SetContains256bitAVX(false);
#endif
return;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than duplicate this code, perhaps this could be factored out into a small method, e.g. CodeGen::genVZeroUpperIfNeeded.
Also, why are the flags reset before returning?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will refactor it into a small method. Those flags does not need to be reset since they have been reset already at Compiler::compSetProcessor().

@@ -2315,6 +2315,8 @@ void emitter::emitIns(instruction ins)
id->idInsFmt(fmt);
id->idCodeSize(sz);

emitAttr attr = id->idOpSize();
CheckAndSetContainsAVX(ins, attr);
dispIns(id);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this code in all these methods seems unnecessary. I believe there are methods that are always called to encode an AVX instruction, and we should attempt to localize this checking there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move these to Lower.

if (call->IsPInvoke() && call->gtCallType == CT_USER_FUNC && compiler->getFloatingPointInstructionSet() == InstructionSet_AVX)
{
if (getEmitter()->Contains256bitAVX())
{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that if this method is called in a loop, and there are 256-bit AVX instructions emitted after this call, the vzeroupper won't be emitted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another way to detect whether method has any 256-bit AVX instructions or 128/256-bit instructions, is to check whether any of gentree nodes are of SIMD vector types during lower register specification (lowerxarch.cpp:TreeNodeInfoInit()).

Contains256BitAvxInstr = (getSIMDInstructionSet() == InstructionSet_AVX) and method has TYP_SIMD32 GenTree nodes as indicated by Lower register specification
ContainsAVXInstr = (getSIMDInstructionSet() == InstructionSet_AVX) and method has TYP_SIMD* GenTree nodes as indicated by Lower register specification.

This info could be used by codegen to emit vzeroupper as required.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that Lower register specification detecting the usage of AVX 128/256-bit instructions also solves the problem that Carol pointed out above, because Lower Regiter specification runs before Codegen phase.

The method that can set a flag to indicate whether AVX 128/256-bit instructions will be used is lowerxarch.cpp!Lowering:TreeNodeInfoInitSIMD()

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the inputs, I will give it a try and update the PR.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sivarv great suggestions!

@CarolEidt
Copy link

@sivarv PTAL
cc @dotnet/jit-contrib

//
// Params
// is256bitAVX - Flag to check if the function contains 256-bit AVX instruction and generate Vzeroupper
// instruction, otherwise check if the function contains AVX instruciton (either 128-bit or 256-bit).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter name is a little confusing. It might be better to call this check256bitOnly to make it clear that this is called when we only need the vzeroupper if we've used 256-bit AVX.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion, will change it accordingly

@@ -10582,6 +10583,18 @@ void CodeGen::genPreserveCalleeSavedFltRegs(unsigned lclFrameSize)
regMaskTP regBit = genRegMask(reg);
if ((regBit & regMask) != 0)
{
#ifdef FEATURE_AVX_SUPPORT
// when we reach here, function does not contains AVX instruction so far, however, since copyIns can

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Should be "does not contain"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will change it.

}
else if (simdTree->gtSIMDSize == 16)
{
comp->getEmitter()->SetContainsAVX(true);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: this could be factored out as it is always done. Also, if we have any SIMD node, then we will be using AVX-encoded instructions. Note that the gtSIMDSize could be 8 or 12 for a fixed-size vector type. I think that you should unconditionally call SetContainsAVX(true) and then only call SetContains256bitAVX(true) for the 32 byte size.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the suggestion, will make the change and update the PR.

There are two two kinds of transition penalties:
1.Transition from 256-bit AVX code to 128-bit legacy SSE code.
2.Transition from 128-bit legacy SSE code to either 128 or
256-bit AVX code. This only happens if there was a preceding
AVX256->legacy SSE transition penalty.

The primary goal is to remove the dotnet#1 AVX to SSE transition penalty.

Added two emitter flags: contains256bitAVXInstruction indicates that
if the JIT method contains 256-bit AVX code, containsAVXInstruction
indicates that if the method contains 128-bit or 256-bit AVX code.

Issue VZEROUPPER in prolog if the method contains 128-bit or 256-bit
AVX code, to avoid legacy SSE to AVX transition penalty, this could
happen for reverse pinvoke situation. Issue VZEROUPPER in epilog
if the method contains 256-bit AVX code, to avoid AVX to legacy
SSE transition penalty.

To limite code size increase impact, we only issue VZEROUPPER before
PInvoke call on user defined function if the JIT method contains
256-bit AVX code, assuming user defined function contains legacy
SSE code. No need to issue VZEROUPPER after PInvoke call because dotnet#2
SSE to AVX transition penalty won't happen since dotnet#1 AVX to SSE
transition has been taken care of before the PInvoke call.

We measured ~3% to 1% performance gain on TechEmPower plaintext and
verified those VTune AVX/SSE events: OTHER_ASSISTS.AVX_TO_SSE and
OTHER_ASSISTS.SSE_TO_AVE have been reduced to 0.

Fix #7240

move setContainsAVX flags to lower, refactor to a smaller method

refactor, fix typo in comments

fix format error
@litian2025
Copy link
Author

I updated the PR to address the feedback, rebased master, resolved conflicts and squashed the commits to 1. @CarolEidt please let me what else needs to be done, thanks.

@CarolEidt
Copy link

Looks good to me. I would like @sivarv to have a look before we merge. Thanks!

@@ -4573,6 +4579,32 @@ void Lowering::SetMulOpCounts(GenTreePtr tree)
}

//------------------------------------------------------------------------------
// SetContainsAVXFlags: default value of isFloatingType is true, we set the
// ContainsAVX flag when floating type value is true, when SIMD vector size is
// 32 bytes, it is 256bit AVX instruction and we set Contains256bitAVX flag too
Copy link
Member

@sivarv sivarv Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a bit difficult to parse this comment block. I would suggest to simplify it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will simplify it for something like: "Set ContainsAVX flag when it is floating type, set Contains256bitAVX flag when SIMD vector size is 32 bytes"

// isFloatingType - is floating type
// sizeOfSIMDVector - SIMD Vector size
//
void Lowering::SetContainsAVXFlags(bool isFloatingType, unsigned sizeOfSIMDVector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: We can indicate the default values of params as follows:

void Lowering::SetContainsAVXFlags(bool isFloatingType /* = true /,
unsigned sizeOfSIMDVector /
= 0 */)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -1954,6 +1957,8 @@ void Lowering::TreeNodeInfoInitBlockStore(GenTreeBlk* blkNode)
// series of 16-byte loads and stores.
blkNode->gtLsraInfo.internalFloatCount = 1;
blkNode->gtLsraInfo.addInternalCandidates(l, l->internalFloatRegCandidates());
// use XMM register for load and store, need set the flag for AVX instruction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// use XMM register for load and store, need set the flag for AVX instruction [](start = 20, length = 77)

I would suggest to rephrase this comment as

"Uses XMM reg for load and store and hence check to see whether AVX instructions are used for codegen"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will change it

instGen(INS_vzeroupper);
bVzeroupperIssued = true;
}
#endif
Copy link
Member

@sivarv sivarv Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Windows xarch has callee saved float regs whereas Unix x64 no callee saved regs.

On Windows, we would be saving callee saved float regs only if method has floating point or SIMD codegen. In such a case lowerxarch.cpp logic will set the required AVX flags on emitter and why do we need this logic here?

In other words, what is the case in which AVX flags are not set but method is attempting to save callee saved float regs to stack in prolog?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, given lower has set the AVX flags already, this can be removed. Also will change genVzeroupperIfNeeded() return value to void since we don't need to check the return value any more.

// isFloatingType - true if it is floating type
// sizeOfSIMDVector - SIMD Vector size
//
void Lowering::SetContainsAVXFlags(bool isFloatingType /* = true */, unsigned sizeOfSIMDVector /* = 0*/)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isFloatingTyp [](start = 40, length = 13)

Please rename this as isFloatingPointType.
(IsFloatingType sounds a bit strange!)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

// if the function contains 256bit AVX instructions, this is to avoid AVX-256 to Legacy SSE
// transition penalty, assuming the user function contains legacy SSE instruction
if (call->IsPInvoke() && call->gtCallType == CT_USER_FUNC &&
compiler->getFloatingPointInstructionSet() == InstructionSet_AVX)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getFloatingPointInstructionSet [](start = 18, length = 30)

To be accurate, shouldn't this be getSIMDInstructionSet() instead?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to check instruction set here because emitter flag is set based on instruction set. Instead it would suffice to assert here as follows

if (call->IsPInvoke() && (call->gtCallType == CT_USER_FUNC) && getEmitter()->Contains256bitAVX())
{
assert(compiler->getFloatingPointInstructionSet() == InstructionSet_AVX);
instGen(INS_vzeroupper);
}


In reply to: 95471453 [](ancestors = 95471453)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

@@ -10583,7 +10583,8 @@ GenTreePtr CodeGen::genMakeConst(const void* cnsAddr, var_types cnsType, GenTree
// funclet frames: this will be FuncletInfo.fiSpDelta.
void CodeGen::genPreserveCalleeSavedFltRegs(unsigned lclFrameSize)
{
regMaskTP regMask = compiler->compCalleeFPRegsSavedMask;
genVzeroupperIfNeeded(false);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

genVzeroupperIfNeeded(false); [](start = 4, length = 29)

Question:

AFAIK from Intel manual one will incur transition penalty from legacy SSE (scalar or 128-bit) to AVX (128 or 256-bit) or vice versa.

Will there be a transition penalty on transitioning from legacy SSE (scalar or 128-bit) into scalar AVX or vice versa?

If the answer is No: then we don't need to track whether a method contains avx instructions. Rather whether it contains 128/256-bit instructions alone would be sufficient. In such a case we are unnecessarily issuing vzero upper in some cases in the prolog of the method.

If the answer is Yes: Before a user PInvoke call don't we need to issue a vzeroupper (in codegenxarch.cpp)? Right now that logic is issuing a vzeroupper only if the method has 256-bit AVX instructions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The transition penalty happens on transition from 256 bit AVX to legacy SSE (case 1) and from Legacy SSE to 128bit or 256bit AVX (case 2), case 2 only happens if there was a preceding AVX256->legacy SSE transition penalty (case 1). So for the first part, the answer is Yes, there's transition penalty from legacy SSE to AVX (either 128bit or 256bit), RyuJIT generates 128bit and/or 256bit AVX instructions, we insert Vzeroupper in the prolog to avoid that. For the second part, the answer is Yes but penalty only occurs on 256bit AVX to legacy SSE, no transition penalty from 128bit AVX to legacy SSE, that's why the logic is only issuing a vzeroupper if the method has 256bit AVX instruction.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a managed method is using scalar floating point operations but no SIMD (128 or 256-bit) and running on AVX hardware. JIT would produce scalar floating point AVX instructions for the method. In the method prolog we are generating vzeroupper because the logic in lowerxarch would set a flag on emitter to indicate, method uses AVX instructions (but no 256-bit).

One reason i could think of is reverse PInvoking such a managed method. The native code could be using legacy SSE (scalar or 128-bit) and calling to managed method could result in legacy SSE to scalar AVX transition. Can you please confirm that will incur transition penalty? It is not clear to me what Intel manual says in this regard.


In reply to: 95486702 [](ancestors = 95486702)

comp->codeGen->getEmitter()->SetContains256bitAVX(true);
}
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be accurate I would suggest to write it as follows

if (isFloatingPointType)
{
if (comp->getFloatingPointInstructionSet() == InstructionSet_AVX)
{
comp->getEmitter()->setContainsAVX(true);
}

 if (sizeOfSIMDVector == 32 && comp->getSIMDInstructionSet() == InstructionSet_AVX)
 {
       comp->codeGen->getEmitter()->SetContains256bitAVX(true);
 }

}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

{
#ifdef FEATURE_AVX_SUPPORT
if (compiler->getSIMDInstructionSet() == InstructionSet_AVX)
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here too we don't need to check for instruction set because lower is making that check before setting flags on emitter. Instead we should assert.

bool emitVzeroUpper;
if (check256bitOnly)
{
emitVzeroUpper = getEmitter()->Contains256bitAVX();
}
else
{
emitVzeroUpper = getEmitter()->ContainsAVX();
}

if (emitVzeroUpper)
{
assert(compiler->getSIMDInstructionSet() == InstructionSet_AVX);
instGen(INS_vzeroupper);
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed accordingly.

@sivarv
Copy link
Member

sivarv commented Jan 12, 2017

Looks good.

@sivarv sivarv merged commit d187267 into dotnet:master Jan 12, 2017
@litian2025 litian2025 deleted the AVX_SSE branch January 12, 2017 16:16
@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
Remove AVX/SSE transition penalties

Commit migrated from dotnet/coreclr@d187267
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants