-
Notifications
You must be signed in to change notification settings - Fork 5k
JIT: Shrink data section for const vector loads #114040
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tannergooding I've updated this based on the approach we discussed on Discord, including using broadcast loads where possible.
Latest diffs look good. It's overall a small code size regression, but generally we're trading +1 byte of code for -8 or more bytes of data.
src/coreclr/jit/gentree.cpp
Outdated
@@ -28266,9 +28266,6 @@ bool GenTreeHWIntrinsic::OperIsMemoryLoad(GenTree** pAddr) const | |||
case NI_AVX2_ConvertToVector256Int16: | |||
case NI_AVX2_ConvertToVector256Int32: | |||
case NI_AVX2_ConvertToVector256Int64: | |||
case NI_AVX2_BroadcastVector128ToVector256: | |||
case NI_AVX512F_BroadcastVector128ToVector512: | |||
case NI_AVX512F_BroadcastVector256ToVector512: | |||
if (GetAuxiliaryJitType() == CORINFO_TYPE_PTR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of these deletions are reverting questionable changes from #92017, which attempted to solve the same problem, but only for uncontained CNS_VEC directly consumed by STOREIND. Those cases are handled better by this PR.
@@ -1412,7 +1412,7 @@ void EvaluateWithElementFloating(var_types simdBaseType, TSimd* result, const TS | |||
|
|||
case TYP_DOUBLE: | |||
{ | |||
result->f64[arg1] = static_cast<float>(arg2); | |||
result->f64[arg1] = arg2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This, and the change in valuenum.cpp are unrelated things I noticed while reading how the SIMD constant folding works.
Yep, that's pretty typical. I'd actually like if we had a way for SPMI to track the diff of the method local data section as well to help account for such changes. Possibly @EgorBo or @jakobbotsch have ideas on if that's feasible. |
// because all scalar and vector loads from memory zero the upper. | ||
assert(emitComp->IsBaselineVector512IsaSupportedDebugOnly()); | ||
dataSize = 32; | ||
ins = INS_vbroadcastf32x8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: It's not as important nowadays and I wouldn't block the PR on it. But it might be beneficial to take a simdBaseType
and use it to pick a more "optimal" instruction where possible.
This "can" impact alignment, atomicity, general readability of the disasm output, potentially even performance, etc. So trying to match the broadcast used to the user type is beneficial where possible. Of course we may not always have a type and float
is the common fallback, but doing a best case of picking f32x8
vs f64x4
vs i32x8
vs i64x4
and so on is beneficial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we chatted about yesterday. This method is only called from genSetRegToConst
, which generates code for uncontained CNS_VEC
, because its consumer needs it in a register. CNS_VEC
doesn't have a simdBaseType
and apparently doesn't have room to add one.
All we know at this callsite is what the constant is and what register to put it in. The best we can do is produce a register with the correct content, hopefully in the most efficient way possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should still be able to get the user of the node to see what the likely type is and default to float if a user doesn’t exist or if it isn’t a hwintrinsic node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I could make a semi-informed guess, but in this particular case, it's a purely aesthetic concern. Even going back to old CPU architectures that had a bypass delay when switching between integer and float domains, load and store instructions have always been type agnostic in practice. And element size doesn't matter here because we're not using embedded masking --otherwise the constant would be contained in another instruction and not getting its own codegen.
So I think in this case, guessing the type and changing the instruction mnemonic is unnecessary complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CC. @dotnet/jit-contrib, @EgorBo, @jakobbotsch for secondary review
Ping @dotnet/jit-contrib, @EgorBo, @jakobbotsch for secondary review |
/azp run runtime-coreclr outerloop |
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
Noticed this while looking at the codegen for floating->integral casts where we use
vfixupimms[sd]
, which is a SIMD scalar instruction that takes a control table as a vector. The table value for that instruction is almost always const, and it's the operand that supports memory load, so we end up emitting a full SIMD16 vector to the data section for what ends up being a DWORD or QWORD reloc. This change identifies vector constants contained by instructions that read only a scalar and shrinks the emitted data section value down to the scalar size.We were also emitting full vectors to the data section for uncontained CNS_VEC to be loaded to register for use. #92017 had previously addressed compressing large vectors down to SIMD16 broadcast when they were immediately consumed by STOREIND. This PR does compression down as far as 4-byte scalar, and does it for all uncontained CNS_VEC loads.
Diffs