Skip to content

Conversation

@tannergooding
Copy link
Member

This updates how Vector64/128/256.Load and Vector64/128/256.Store are handled.

In particular, Load is imported as a GT_IND and Store is imported as a GT_ASG. This allows existing optimizations that are present for the more common IR to kick in.

The primary "negative" of this PR is that in disassembly. In particular, GT_HWINTRINSIC tracks a base type where-as GT_IND and GT_ASG do not. This means that where a Load/Store for Vector128<int> might use a movdqu, the corresponding code for GT_IND/ASG always uses movups. This should have no impact since loads/stores are handled by their own pipeline that counts as neither "integer" nor "floating-point" based, so there is no additional latency for its subsequent use.

This same optimization will not be possible for more complex types of loads/stores, such as aligned, scalar, or non-temporal, without additional work to track the relevant information.

@ghost ghost assigned tannergooding Jan 10, 2023
@ghost ghost added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Jan 10, 2023
@ghost
Copy link

ghost commented Jan 10, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

This updates how Vector64/128/256.Load and Vector64/128/256.Store are handled.

In particular, Load is imported as a GT_IND and Store is imported as a GT_ASG. This allows existing optimizations that are present for the more common IR to kick in.

The primary "negative" of this PR is that in disassembly. In particular, GT_HWINTRINSIC tracks a base type where-as GT_IND and GT_ASG do not. This means that where a Load/Store for Vector128<int> might use a movdqu, the corresponding code for GT_IND/ASG always uses movups. This should have no impact since loads/stores are handled by their own pipeline that counts as neither "integer" nor "floating-point" based, so there is no additional latency for its subsequent use.

This same optimization will not be possible for more complex types of loads/stores, such as aligned, scalar, or non-temporal, without additional work to track the relevant information.

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr

Milestone: -

@tannergooding
Copy link
Member Author

It's worth noting that @EgorBo and @SingleAccretion have both suggested this change previously (with Egor doing a minimal prototype as well). At the time, I was unsure if such a change would have been profitable given the planned work around AVX-512, VectorMask, and other scenarios where the simdBaseType became relevant again. After further consideration and investigation, I believe that the general transition here is going to be profitable, even if we ultimately have some other cases that do not get to participate in all the same optimizations as IND/ASG presently have.

@tannergooding tannergooding marked this pull request as ready for review January 10, 2023 17:38
@tannergooding
Copy link
Member Author

CC. @dotnet/jit-contrib

Results in code simplification, 50k/30k smaller codegen for minopts on Arm64/x64, 90k/3k smaller codegen for fullopts on Arm64/x64, and a -0.02% to -0.05% TP improvement

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

tannergooding commented Jan 12, 2023

No changes, just merged in main in a hopes that it resolves the unrelated Arm64 CI failures.

All jitstress tests passed, so not going to rerun them.

Copy link
Member

@EgorBo EgorBo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean up, looking forward to seeing perf impact

@tannergooding tannergooding merged commit a2029fe into dotnet:main Jan 13, 2023
@tannergooding tannergooding deleted the better-simdld/st branch January 13, 2023 15:15
@ghost ghost locked as resolved and limited conversation to collaborators Feb 12, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants