Skip to content

JIT: Shrink data section for const vector loads #114040

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

saucecontrol
Copy link
Member

@saucecontrol saucecontrol commented Mar 29, 2025

Noticed this while looking at the codegen for floating->integral casts where we use vfixupimms[sd], which is a SIMD scalar instruction that takes a control table as a vector. The table value for that instruction is almost always const, and it's the operand that supports memory load, so we end up emitting a full SIMD16 vector to the data section for what ends up being a DWORD or QWORD reloc. This change identifies vector constants contained by instructions that read only a scalar and shrinks the emitted data section value down to the scalar size.

We were also emitting full vectors to the data section for uncontained CNS_VEC to be loaded to register for use. #92017 had previously addressed compressing large vectors down to SIMD16 broadcast when they were immediately consumed by STOREIND. This PR does compression down as far as 4-byte scalar, and does it for all uncontained CNS_VEC loads.

Diffs

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 29, 2025
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 29, 2025
Copy link
Contributor

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Copy link
Member Author

@saucecontrol saucecontrol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding I've updated this based on the approach we discussed on Discord, including using broadcast loads where possible.

Latest diffs look good. It's overall a small code size regression, but generally we're trading +1 byte of code for -8 or more bytes of data.

@@ -28266,9 +28266,6 @@ bool GenTreeHWIntrinsic::OperIsMemoryLoad(GenTree** pAddr) const
case NI_AVX2_ConvertToVector256Int16:
case NI_AVX2_ConvertToVector256Int32:
case NI_AVX2_ConvertToVector256Int64:
case NI_AVX2_BroadcastVector128ToVector256:
case NI_AVX512F_BroadcastVector128ToVector512:
case NI_AVX512F_BroadcastVector256ToVector512:
if (GetAuxiliaryJitType() == CORINFO_TYPE_PTR)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these deletions are reverting questionable changes from #92017, which attempted to solve the same problem, but only for uncontained CNS_VEC directly consumed by STOREIND. Those cases are handled better by this PR.

@@ -1412,7 +1412,7 @@ void EvaluateWithElementFloating(var_types simdBaseType, TSimd* result, const TS

case TYP_DOUBLE:
{
result->f64[arg1] = static_cast<float>(arg2);
result->f64[arg1] = arg2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and the change in valuenum.cpp are unrelated things I noticed while reading how the SIMD constant folding works.

@saucecontrol saucecontrol changed the title JIT: Shrink data for scalar instructions with contained const vectors JIT: Shrink data section for const vector loads Apr 2, 2025
@tannergooding
Copy link
Member

It's overall a small code size regression, but generally we're trading +1 byte of code for -8 or more bytes of data.

Yep, that's pretty typical. I'd actually like if we had a way for SPMI to track the diff of the method local data section as well to help account for such changes. Possibly @EgorBo or @jakobbotsch have ideas on if that's feasible.

// because all scalar and vector loads from memory zero the upper.
assert(emitComp->IsBaselineVector512IsaSupportedDebugOnly());
dataSize = 32;
ins = INS_vbroadcastf32x8;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's not as important nowadays and I wouldn't block the PR on it. But it might be beneficial to take a simdBaseType and use it to pick a more "optimal" instruction where possible.

This "can" impact alignment, atomicity, general readability of the disasm output, potentially even performance, etc. So trying to match the broadcast used to the user type is beneficial where possible. Of course we may not always have a type and float is the common fallback, but doing a best case of picking f32x8 vs f64x4 vs i32x8 vs i64x4 and so on is beneficial.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we chatted about yesterday. This method is only called from genSetRegToConst, which generates code for uncontained CNS_VEC, because its consumer needs it in a register. CNS_VEC doesn't have a simdBaseType and apparently doesn't have room to add one.

All we know at this callsite is what the constant is and what register to put it in. The best we can do is produce a register with the correct content, hopefully in the most efficient way possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should still be able to get the user of the node to see what the likely type is and default to float if a user doesn’t exist or if it isn’t a hwintrinsic node.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I could make a semi-informed guess, but in this particular case, it's a purely aesthetic concern. Even going back to old CPU architectures that had a bypass delay when switching between integer and float domains, load and store instructions have always been type agnostic in practice. And element size doesn't matter here because we're not using embedded masking --otherwise the constant would be contained in another instruction and not getting its own codegen.

So I think in this case, guessing the type and changing the instruction mnemonic is unnecessary complexity.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CC. @dotnet/jit-contrib, @EgorBo, @jakobbotsch for secondary review

@tannergooding
Copy link
Member

Ping @dotnet/jit-contrib, @EgorBo, @jakobbotsch for secondary review

@BruceForstall
Copy link
Member

/azp run runtime-coreclr outerloop

Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants