-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Ensure Vector<T>.op_Multiply is handled as an intrinsic in appropriate cases #49503
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
CC. @echesakovMSFT |
|
CC. @dotnet/jit-contrib |
| GenTree** broadcastOp = nullptr; | ||
|
|
||
| if (varTypeIsArithmetic(op1->TypeGet())) | ||
| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Arm64 you don't neet broadcast operation when either op1 or op2 are floating-point.
Since the operands will be in SIMD registers already you should use MultiplyByElement (where elementIndex is 0) instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also don't think you need to broadcast an integer value to the whole SIMD register.
You can use ins Vd.T[0], Xd followed by mul Vd, Vn, Vd.T[0] instead.
In this case, you can share some of the implementation for floating-point and integer types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should need a CreateScalarUnsafe in this case instead then. I'll update to do the right thing.
- Notably this wasn't possible before the ARM HWIntrinsics were brought online, which is likely why Vector wasn't doing that already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Notably, MultiplyByScalar and MultiplyBySelectedScalar don't work for byte/sbyte and so those still need a broadcast.
| case TYP_INT: | ||
| case TYP_UINT: | ||
| { | ||
| hwIntrinsic = NI_AVX2_MultiplyLow; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouildn't this be guarded with compOpportunisticallyDependsOn(InstructionSet_AVX2)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, its already asserted at the top of the method (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/simdashwintrinsic.cpp#L392) as well as properly guarded further up in the stack.
We should never encounter SimdAsHWIntrinsicClassId::VectorT256 if AVX2 is not supported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why at least AVX2 is required for VectorT256? Shouldn't AVX be sufficient?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Vector<T> supports all 10 primitive types and can only properly accelerate the integer types with AVX2 so it was decided it is only 32-bytes if AVX2 is supported and is 16-bytes otherwise, for the best overall perf/experience.
47e1038 to
a79a1cb
Compare
|
@echesakovMSFT, any other feedback here or is it good to merge? |
echesakov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good.
Thank you
This resolves #49071 and resolves #30923 by ensuring
Vector<T> * Tis treated as an intrinsic so it generates the ideal codegen and not a for loop:.NET 5
Disassembly .NET 5.0.4 (5.0.421.11614), X64 RyuJIT
.NET 6
Disassembly .NET 6.0.0 (42.42.42.42424), X64 RyuJIT
Notably it is still slightly slower. This appears to be largely due to additional
movapsinserted before thebroadcastinstructions:This appears to be because we insert
Avx2.BroadcastScalarToVector256(Vector128.CreateScalarUnsafe(value))during lowering. The register allocator then assigns a different register forCreateScalarUnsafeeven though its marked astgtPrefUse(this appears to be related to this happening inside a loop).I'm investigating further to see if there is a trivial fix here (other than expanding
Vector256_Createearlier, such as in the importer).