Skip to content

JIT: Refine assert for RMW check in crc32 codegen #108750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jakobbotsch
Copy link
Member

The crc32 HW intrinsic is a 2 operand node that produces a register; however, the underlying instruction is an RMW instruction that destructively modifies op1. For that reason the codegen needs to issue a move of op1 into the target register before crc32 is issued.

There is an assert that this mov does not overwrite op2 in the process. When LSRA builds uses for the operands it needs to mark op2 as delay freed to ensure op2 and the target do not end up in the same register. However, due to optimizations LSRA may skip this marking when op1 is a last use. Thus we can end up hitting the assert.

The most logical solution seems to be to actually ensure that op2 is always delay freed, such that it never shares the register with the target. However, the handling for other RMW nodes (like GT_SUB) does not have similar logic to this; instead it relies on the situation to not happen except for one particular case where it is ok: when op1 and op2 are actually the same local. This PR enhances the assert to match the checks that happen for other RMW nodes.

Fix #108601

The crc32 HW intrinsic is a 2 operand node that produces a register;
however, the underlying instruction is an RMW instruction that
destructively modifies `op1`. For that reason the codegen needs to issue
a move of op1 into the target register before `crc32` is issued.

There is an assert that this `mov` does not overwrite op2 in the
process. When LSRA builds uses for the operands it needs to mark op2 as
delay freed to ensure op2 and the target do not end up in the same
register. However, due to optimizations LSRA may skip this marking when
op1 is a last use. Thus we can end up hitting the assert.

The most logical solution seems to be to actually ensure that op2 is
always delay freed, such that it never shares the register with the
target. However, the handling for other RMW nodes (like `GT_SUB`) does
not have similar logic to this; instead it relies on the situation to
not happen except for one particular case where it is ok: when `op1` and
`op2` are actually the same local. This PR enhances the assert to match
the checks that happen for other RMW nodes.
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 10, 2024
@jakobbotsch
Copy link
Member Author

This matches the logic in genCodeForBinary from here:

#ifdef DEBUG
unsigned lclNum1 = (unsigned)-1;
unsigned lclNum2 = (unsigned)-2;
GenTree* op1Skip = op1->gtSkipReloadOrCopy();
GenTree* op2Skip = op2->gtSkipReloadOrCopy();
if (op1Skip->OperIsLocalRead())
{
lclNum1 = op1Skip->AsLclVarCommon()->GetLclNum();
}
if (op2Skip->OperIsLocalRead())
{
lclNum2 = op2Skip->AsLclVarCommon()->GetLclNum();
}
assert(GenTree::OperIsCommutative(oper) || (lclNum1 == lclNum2));
#endif

To be honest I do not see the constraint properly encoded in LSRA for the common RMW nodes or for crc32, which is a bit scary. In other words, if we have IR like

/- op1 LCL_VAR V01 last use
/- op2 LCL_VAR V02
GT_SUB

then I think op2 will have a use created that is not marked delay freed. That means the following register assignment is perfectly allowable:

/- op1 LCL_VAR V01 last use rdx
/- op2 LCL_VAR V02 rcx
GT_SUB rcx

however, this assignment would hit the assert in genCodeForBinary. The only reason this does not happen seems to be due to preferencing making it so that LSRA ends up not picking this assignment.

Comment on lines 2244 to 2256
if (op2->isUsedFromReg() && (op2->GetRegNum() == targetReg) && (op1Reg != targetReg))
{
// The move we are issuing is going to overwrite 'op2' with
// 'op1'. This is still ok if 'op1' and 'op2' are the same
// local; normally they would be in the same register in that
// case, but LSRA may have inserted a copy for one of the
// operands (either due to stress or for any other reason).
// This check mirrors what happens in genCodeForBinary.
GenTree* op1Skip = op1->gtSkipReloadOrCopy();
GenTree* op2Skip = op2->gtSkipReloadOrCopy();
assert(op1Skip->OperIsLocalRead() && GenTree::Compare(op1Skip, op2Skip) &&
"'mov' to set up crc32 RMW is going to overwrite op2!");
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly verbose check that seemingly we'd need to be doing everywhere we currently have assert(!op2->isUsedFromReg() || (op2->GetRegNum() != targetReg) || (op1Reg == targetReg)); or similar (for both x86/x64 and Arm64).

Should it be extracted to a helper method?

-- This is seemingly also an issue everywhere we're doing such a check (which I think is before most RMW moves). Is there something unique to crc32 that's causing it to trigger here and not in the other places?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a fairly verbose check that seemingly we'd need to be doing everywhere we currently have assert(!op2->isUsedFromReg() || (op2->GetRegNum() != targetReg) || (op1Reg == targetReg)); or similar (for both x86/x64 and Arm64).

Should it be extracted to a helper method?

-- This is seemingly also an issue everywhere we're doing such a check (which I think is before most RMW moves). Is there something unique to crc32 that's causing it to trigger here and not in the other places?

Can you point to some of them? If they use this pattern and LSRA builds the uses in the same way, then yeah they can likely hit this problem.

I think a preferable fix would be what I pointed out: skip the LSRA optimization that currently happens inside AddDelayFreeUses, since that optimization really is not correct (in the sense that it does not properly model the constraints that we have here). We will probably need to keep it for the normal RMW ops due to diffs, but for other cases like crc32 the optimization does not seem important enough to warrant.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have various asserts in the emitIns_SIMD_* helpers: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/emitxarch.cpp#L8837. I believe we also have similar being done for Arm64 and in the various *codegen*.cpp files (codegenxarch and hwintrinsiccodegenarm64 for example).

I don't think we want to unnecessarily pessimize the codegen here by forcing them to never match either. We really just want a basic check that the RMW handling (i.e. the mov tgt, op1) isn't overwriting one of the other input operands (in other words, that they were correctly marked delay free).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted a function to check the "same local" case. It wasn't straightforward for me to find the other places that need this check, so let's just cross that bridge if we see them in the future. I think that's ok since the assert is harmless in this case.

#endif

assert(!op2->isUsedFromReg() || (op2->GetRegNum() != targetReg) || (op1Reg == targetReg) ||
genIsSameLocalVar(op1, op2));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something that's been nagging the back of my mind...

Do we know why LSRA is assigning a single local var to two registers simultaneously? If it's the same local and its the same instruction/use, then it really should've ensured that op1Reg == op2Reg, no?

That is, this seems like a general deficiency or issue with assignments leading to poorer codegen and higher register pressure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
2 participants