-
Notifications
You must be signed in to change notification settings - Fork 5k
JIT: Refine assert for RMW check in crc32 codegen #108750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
The crc32 HW intrinsic is a 2 operand node that produces a register; however, the underlying instruction is an RMW instruction that destructively modifies `op1`. For that reason the codegen needs to issue a move of op1 into the target register before `crc32` is issued. There is an assert that this `mov` does not overwrite op2 in the process. When LSRA builds uses for the operands it needs to mark op2 as delay freed to ensure op2 and the target do not end up in the same register. However, due to optimizations LSRA may skip this marking when op1 is a last use. Thus we can end up hitting the assert. The most logical solution seems to be to actually ensure that op2 is always delay freed, such that it never shares the register with the target. However, the handling for other RMW nodes (like `GT_SUB`) does not have similar logic to this; instead it relies on the situation to not happen except for one particular case where it is ok: when `op1` and `op2` are actually the same local. This PR enhances the assert to match the checks that happen for other RMW nodes.
This matches the logic in runtime/src/coreclr/jit/codegenxarch.cpp Lines 1150 to 1167 in e32148a
To be honest I do not see the constraint properly encoded in LSRA for the common RMW nodes or for crc32, which is a bit scary. In other words, if we have IR like /- op1 LCL_VAR V01 last use
/- op2 LCL_VAR V02
GT_SUB then I think /- op1 LCL_VAR V01 last use rdx
/- op2 LCL_VAR V02 rcx
GT_SUB rcx however, this assignment would hit the assert in |
if (op2->isUsedFromReg() && (op2->GetRegNum() == targetReg) && (op1Reg != targetReg)) | ||
{ | ||
// The move we are issuing is going to overwrite 'op2' with | ||
// 'op1'. This is still ok if 'op1' and 'op2' are the same | ||
// local; normally they would be in the same register in that | ||
// case, but LSRA may have inserted a copy for one of the | ||
// operands (either due to stress or for any other reason). | ||
// This check mirrors what happens in genCodeForBinary. | ||
GenTree* op1Skip = op1->gtSkipReloadOrCopy(); | ||
GenTree* op2Skip = op2->gtSkipReloadOrCopy(); | ||
assert(op1Skip->OperIsLocalRead() && GenTree::Compare(op1Skip, op2Skip) && | ||
"'mov' to set up crc32 RMW is going to overwrite op2!"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly verbose check that seemingly we'd need to be doing everywhere we currently have assert(!op2->isUsedFromReg() || (op2->GetRegNum() != targetReg) || (op1Reg == targetReg));
or similar (for both x86/x64 and Arm64).
Should it be extracted to a helper method?
-- This is seemingly also an issue everywhere we're doing such a check (which I think is before most RMW moves
). Is there something unique to crc32
that's causing it to trigger here and not in the other places?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a fairly verbose check that seemingly we'd need to be doing everywhere we currently have assert(!op2->isUsedFromReg() || (op2->GetRegNum() != targetReg) || (op1Reg == targetReg)); or similar (for both x86/x64 and Arm64).
Should it be extracted to a helper method?
-- This is seemingly also an issue everywhere we're doing such a check (which I think is before most RMW moves). Is there something unique to crc32 that's causing it to trigger here and not in the other places?
Can you point to some of them? If they use this pattern and LSRA builds the uses in the same way, then yeah they can likely hit this problem.
I think a preferable fix would be what I pointed out: skip the LSRA optimization that currently happens inside AddDelayFreeUses
, since that optimization really is not correct (in the sense that it does not properly model the constraints that we have here). We will probably need to keep it for the normal RMW ops due to diffs, but for other cases like crc32 the optimization does not seem important enough to warrant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have various asserts in the emitIns_SIMD_*
helpers: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/emitxarch.cpp#L8837. I believe we also have similar being done for Arm64 and in the various *codegen*.cpp
files (codegenxarch
and hwintrinsiccodegenarm64
for example).
I don't think we want to unnecessarily pessimize the codegen here by forcing them to never match either. We really just want a basic check that the RMW handling (i.e. the mov tgt, op1
) isn't overwriting one of the other input operands (in other words, that they were correctly marked delay free).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted a function to check the "same local" case. It wasn't straightforward for me to find the other places that need this check, so let's just cross that bridge if we see them in the future. I think that's ok since the assert is harmless in this case.
#endif | ||
|
||
assert(!op2->isUsedFromReg() || (op2->GetRegNum() != targetReg) || (op1Reg == targetReg) || | ||
genIsSameLocalVar(op1, op2)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something that's been nagging the back of my mind...
Do we know why LSRA is assigning a single local var to two registers simultaneously? If it's the same local and its the same instruction/use, then it really should've ensured that op1Reg == op2Reg
, no?
That is, this seems like a general deficiency or issue with assignments leading to poorer codegen and higher register pressure.
The crc32 HW intrinsic is a 2 operand node that produces a register; however, the underlying instruction is an RMW instruction that destructively modifies
op1
. For that reason the codegen needs to issue a move of op1 into the target register beforecrc32
is issued.There is an assert that this
mov
does not overwrite op2 in the process. When LSRA builds uses for the operands it needs to mark op2 as delay freed to ensure op2 and the target do not end up in the same register. However, due to optimizations LSRA may skip this marking when op1 is a last use. Thus we can end up hitting the assert.The most logical solution seems to be to actually ensure that op2 is always delay freed, such that it never shares the register with the target. However, the handling for other RMW nodes (like
GT_SUB
) does not have similar logic to this; instead it relies on the situation to not happen except for one particular case where it is ok: whenop1
andop2
are actually the same local. This PR enhances the assert to match the checks that happen for other RMW nodes.Fix #108601