-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Conversation
@CarolEidt PTAL |
instGen(INS_vzeroupper); | ||
bVzeroupperIssued = true; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering why we would have the case where we have no AVX instructions emitted (i.e. we didn't issue the vzeroupper above), but we have an AVX copy instruction. That is, if we are using AVX encodings and have no AVX instructions issued, then how can we be using callee save floating point registers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have AVX copy instruction such as vmovupd, it is 128bit AVX instruction, the verzoupper instruction needs to be issued before the AVX copy instruction (vmovupd etc.) to avoid SSE->AVX transition penalty
// reset both AVX and 256bit AVX flags for the function before return | ||
getEmitter()->SetContainsAVX(false); | ||
getEmitter()->SetContains256bitAVX(false); | ||
#endif | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than duplicate this code, perhaps this could be factored out into a small method, e.g. CodeGen::genVZeroUpperIfNeeded
.
Also, why are the flags reset before returning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will refactor it into a small method. Those flags does not need to be reset since they have been reset already at Compiler::compSetProcessor().
@@ -2315,6 +2315,8 @@ void emitter::emitIns(instruction ins) | |||
id->idInsFmt(fmt); | |||
id->idCodeSize(sz); | |||
|
|||
emitAttr attr = id->idOpSize(); | |||
CheckAndSetContainsAVX(ins, attr); | |||
dispIns(id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this code in all these methods seems unnecessary. I believe there are methods that are always called to encode an AVX instruction, and we should attempt to localize this checking there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move these to Lower.
if (call->IsPInvoke() && call->gtCallType == CT_USER_FUNC && compiler->getFloatingPointInstructionSet() == InstructionSet_AVX) | ||
{ | ||
if (getEmitter()->Contains256bitAVX()) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that if this method is called in a loop, and there are 256-bit AVX instructions emitted after this call, the vzeroupper won't be emitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another way to detect whether method has any 256-bit AVX instructions or 128/256-bit instructions, is to check whether any of gentree nodes are of SIMD vector types during lower register specification (lowerxarch.cpp:TreeNodeInfoInit()).
Contains256BitAvxInstr = (getSIMDInstructionSet() == InstructionSet_AVX) and method has TYP_SIMD32 GenTree nodes as indicated by Lower register specification
ContainsAVXInstr = (getSIMDInstructionSet() == InstructionSet_AVX) and method has TYP_SIMD* GenTree nodes as indicated by Lower register specification.
This info could be used by codegen to emit vzeroupper as required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that Lower register specification detecting the usage of AVX 128/256-bit instructions also solves the problem that Carol pointed out above, because Lower Regiter specification runs before Codegen phase.
The method that can set a flag to indicate whether AVX 128/256-bit instructions will be used is lowerxarch.cpp!Lowering:TreeNodeInfoInitSIMD()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the inputs, I will give it a try and update the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sivarv great suggestions!
@sivarv PTAL |
// | ||
// Params | ||
// is256bitAVX - Flag to check if the function contains 256-bit AVX instruction and generate Vzeroupper | ||
// instruction, otherwise check if the function contains AVX instruciton (either 128-bit or 256-bit). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This parameter name is a little confusing. It might be better to call this check256bitOnly
to make it clear that this is called when we only need the vzeroupper if we've used 256-bit AVX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good suggestion, will change it accordingly
@@ -10582,6 +10583,18 @@ void CodeGen::genPreserveCalleeSavedFltRegs(unsigned lclFrameSize) | |||
regMaskTP regBit = genRegMask(reg); | |||
if ((regBit & regMask) != 0) | |||
{ | |||
#ifdef FEATURE_AVX_SUPPORT | |||
// when we reach here, function does not contains AVX instruction so far, however, since copyIns can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: Should be "does not contain"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will change it.
} | ||
else if (simdTree->gtSIMDSize == 16) | ||
{ | ||
comp->getEmitter()->SetContainsAVX(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: this could be factored out as it is always done. Also, if we have any SIMD node, then we will be using AVX-encoded instructions. Note that the gtSIMDSize could be 8 or 12 for a fixed-size vector type. I think that you should unconditionally call SetContainsAVX(true)
and then only call SetContains256bitAVX(true)
for the 32 byte size.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the suggestion, will make the change and update the PR.
There are two two kinds of transition penalties: 1.Transition from 256-bit AVX code to 128-bit legacy SSE code. 2.Transition from 128-bit legacy SSE code to either 128 or 256-bit AVX code. This only happens if there was a preceding AVX256->legacy SSE transition penalty. The primary goal is to remove the dotnet#1 AVX to SSE transition penalty. Added two emitter flags: contains256bitAVXInstruction indicates that if the JIT method contains 256-bit AVX code, containsAVXInstruction indicates that if the method contains 128-bit or 256-bit AVX code. Issue VZEROUPPER in prolog if the method contains 128-bit or 256-bit AVX code, to avoid legacy SSE to AVX transition penalty, this could happen for reverse pinvoke situation. Issue VZEROUPPER in epilog if the method contains 256-bit AVX code, to avoid AVX to legacy SSE transition penalty. To limite code size increase impact, we only issue VZEROUPPER before PInvoke call on user defined function if the JIT method contains 256-bit AVX code, assuming user defined function contains legacy SSE code. No need to issue VZEROUPPER after PInvoke call because dotnet#2 SSE to AVX transition penalty won't happen since dotnet#1 AVX to SSE transition has been taken care of before the PInvoke call. We measured ~3% to 1% performance gain on TechEmPower plaintext and verified those VTune AVX/SSE events: OTHER_ASSISTS.AVX_TO_SSE and OTHER_ASSISTS.SSE_TO_AVE have been reduced to 0. Fix #7240 move setContainsAVX flags to lower, refactor to a smaller method refactor, fix typo in comments fix format error
I updated the PR to address the feedback, rebased master, resolved conflicts and squashed the commits to 1. @CarolEidt please let me what else needs to be done, thanks. |
Looks good to me. I would like @sivarv to have a look before we merge. Thanks! |
@@ -4573,6 +4579,32 @@ void Lowering::SetMulOpCounts(GenTreePtr tree) | |||
} | |||
|
|||
//------------------------------------------------------------------------------ | |||
// SetContainsAVXFlags: default value of isFloatingType is true, we set the | |||
// ContainsAVX flag when floating type value is true, when SIMD vector size is | |||
// 32 bytes, it is 256bit AVX instruction and we set Contains256bitAVX flag too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found a bit difficult to parse this comment block. I would suggest to simplify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will simplify it for something like: "Set ContainsAVX flag when it is floating type, set Contains256bitAVX flag when SIMD vector size is 32 bytes"
// isFloatingType - is floating type | ||
// sizeOfSIMDVector - SIMD Vector size | ||
// | ||
void Lowering::SetContainsAVXFlags(bool isFloatingType, unsigned sizeOfSIMDVector) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: We can indicate the default values of params as follows:
void Lowering::SetContainsAVXFlags(bool isFloatingType /* = true /,
unsigned sizeOfSIMDVector / = 0 */)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -1954,6 +1957,8 @@ void Lowering::TreeNodeInfoInitBlockStore(GenTreeBlk* blkNode) | |||
// series of 16-byte loads and stores. | |||
blkNode->gtLsraInfo.internalFloatCount = 1; | |||
blkNode->gtLsraInfo.addInternalCandidates(l, l->internalFloatRegCandidates()); | |||
// use XMM register for load and store, need set the flag for AVX instruction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// use XMM register for load and store, need set the flag for AVX instruction [](start = 20, length = 77)
I would suggest to rephrase this comment as
"Uses XMM reg for load and store and hence check to see whether AVX instructions are used for codegen"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will change it
instGen(INS_vzeroupper); | ||
bVzeroupperIssued = true; | ||
} | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Windows xarch has callee saved float regs whereas Unix x64 no callee saved regs.
On Windows, we would be saving callee saved float regs only if method has floating point or SIMD codegen. In such a case lowerxarch.cpp logic will set the required AVX flags on emitter and why do we need this logic here?
In other words, what is the case in which AVX flags are not set but method is attempting to save callee saved float regs to stack in prolog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, given lower has set the AVX flags already, this can be removed. Also will change genVzeroupperIfNeeded() return value to void since we don't need to check the return value any more.
// isFloatingType - true if it is floating type | ||
// sizeOfSIMDVector - SIMD Vector size | ||
// | ||
void Lowering::SetContainsAVXFlags(bool isFloatingType /* = true */, unsigned sizeOfSIMDVector /* = 0*/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isFloatingTyp [](start = 40, length = 13)
Please rename this as isFloatingPointType.
(IsFloatingType sounds a bit strange!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
// if the function contains 256bit AVX instructions, this is to avoid AVX-256 to Legacy SSE | ||
// transition penalty, assuming the user function contains legacy SSE instruction | ||
if (call->IsPInvoke() && call->gtCallType == CT_USER_FUNC && | ||
compiler->getFloatingPointInstructionSet() == InstructionSet_AVX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getFloatingPointInstructionSet [](start = 18, length = 30)
To be accurate, shouldn't this be getSIMDInstructionSet() instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to check instruction set here because emitter flag is set based on instruction set. Instead it would suffice to assert here as follows
if (call->IsPInvoke() && (call->gtCallType == CT_USER_FUNC) && getEmitter()->Contains256bitAVX())
{
assert(compiler->getFloatingPointInstructionSet() == InstructionSet_AVX);
instGen(INS_vzeroupper);
}
In reply to: 95471453 [](ancestors = 95471453)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -10583,7 +10583,8 @@ GenTreePtr CodeGen::genMakeConst(const void* cnsAddr, var_types cnsType, GenTree | |||
// funclet frames: this will be FuncletInfo.fiSpDelta. | |||
void CodeGen::genPreserveCalleeSavedFltRegs(unsigned lclFrameSize) | |||
{ | |||
regMaskTP regMask = compiler->compCalleeFPRegsSavedMask; | |||
genVzeroupperIfNeeded(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
genVzeroupperIfNeeded(false); [](start = 4, length = 29)
Question:
AFAIK from Intel manual one will incur transition penalty from legacy SSE (scalar or 128-bit) to AVX (128 or 256-bit) or vice versa.
Will there be a transition penalty on transitioning from legacy SSE (scalar or 128-bit) into scalar AVX or vice versa?
If the answer is No: then we don't need to track whether a method contains avx instructions. Rather whether it contains 128/256-bit instructions alone would be sufficient. In such a case we are unnecessarily issuing vzero upper in some cases in the prolog of the method.
If the answer is Yes: Before a user PInvoke call don't we need to issue a vzeroupper (in codegenxarch.cpp)? Right now that logic is issuing a vzeroupper only if the method has 256-bit AVX instructions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The transition penalty happens on transition from 256 bit AVX to legacy SSE (case 1) and from Legacy SSE to 128bit or 256bit AVX (case 2), case 2 only happens if there was a preceding AVX256->legacy SSE transition penalty (case 1). So for the first part, the answer is Yes, there's transition penalty from legacy SSE to AVX (either 128bit or 256bit), RyuJIT generates 128bit and/or 256bit AVX instructions, we insert Vzeroupper in the prolog to avoid that. For the second part, the answer is Yes but penalty only occurs on 256bit AVX to legacy SSE, no transition penalty from 128bit AVX to legacy SSE, that's why the logic is only issuing a vzeroupper if the method has 256bit AVX instruction.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider a managed method is using scalar floating point operations but no SIMD (128 or 256-bit) and running on AVX hardware. JIT would produce scalar floating point AVX instructions for the method. In the method prolog we are generating vzeroupper because the logic in lowerxarch would set a flag on emitter to indicate, method uses AVX instructions (but no 256-bit).
One reason i could think of is reverse PInvoking such a managed method. The native code could be using legacy SSE (scalar or 128-bit) and calling to managed method could result in legacy SSE to scalar AVX transition. Can you please confirm that will incur transition penalty? It is not clear to me what Intel manual says in this regard.
In reply to: 95486702 [](ancestors = 95486702)
comp->codeGen->getEmitter()->SetContains256bitAVX(true); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be accurate I would suggest to write it as follows
if (isFloatingPointType)
{
if (comp->getFloatingPointInstructionSet() == InstructionSet_AVX)
{
comp->getEmitter()->setContainsAVX(true);
}
if (sizeOfSIMDVector == 32 && comp->getSIMDInstructionSet() == InstructionSet_AVX)
{
comp->codeGen->getEmitter()->SetContains256bitAVX(true);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
{ | ||
#ifdef FEATURE_AVX_SUPPORT | ||
if (compiler->getSIMDInstructionSet() == InstructionSet_AVX) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too we don't need to check for instruction set because lower is making that check before setting flags on emitter. Instead we should assert.
bool emitVzeroUpper;
if (check256bitOnly)
{
emitVzeroUpper = getEmitter()->Contains256bitAVX();
}
else
{
emitVzeroUpper = getEmitter()->ContainsAVX();
}
if (emitVzeroUpper)
{
assert(compiler->getSIMDInstructionSet() == InstructionSet_AVX);
instGen(INS_vzeroupper);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed accordingly.
Looks good. |
Remove AVX/SSE transition penalties Commit migrated from dotnet/coreclr@d187267
There are two two kinds of transition penalties:
1.Transition from 256-bit AVX code to 128-bit legacy SSE code.
2.Transition from 128-bit legacy SSE code to either 128 or 256-bit AVX code. This only happens if there was a preceding AVX256->legacy SSE transition penalty.
The primary goal is to remove the AVX to SSE transition penalty.
Added two emitter flags: contains256bitAVXInstruction indicates that if the JIT method contains 256-bit AVX code, containsAVXInstruction indicates that if the method contains 128-bit or 256-bit AVX code.
Issue VZEROUPPER in Prolog if the method contains 128-bit or 256-bit AVX code, to avoid legacy SSE to AVX transition penalty, this could happen for reverse pinvoke situation. Issue VZEROUPPER in Epilog
if the method contains 256-bit AVX code, to avoid AVX to legacy SSE transition penalty.
To limit code size increase impact, we only issue VZEROUPPER before PInvoke call on user defined function if the JIT method contains 256-bit AVX code, assuming user defined function contains legacy
SSE code. No need to issue VZEROUPPER after PInvoke call because SSE to AVX transition penalty won't happen since AVX to SSE transition has been taken care of before the PInvoke call.
We measured ~3% to 1% performance gain on TechEmPower plaintext and verified those VTune AVX/SSE events: OTHER_ASSISTS.AVX_TO_SSE and OTHER_ASSISTS.SSE_TO_AVE have been reduced to 0.
JITUtils reports no code size increase on Linux Scenario 1 (Running the mscorlib and frameworks diffs using just the assemblies made available by jitutils) and Scenario 2 (Running mscorlib, frameworks, and test assets diffs using the resources generated for a CoreCLR test run), further analysis on the dasm code appears verified the report, since no AVX instruction are present on the dasm code, thus no additional VZEROUPPER instruction are inserted.
Fix #7240