-
Notifications
You must be signed in to change notification settings - Fork 5.3k
Improve Math.BigMul performance on x64 #117261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR backports the Math.BigMul hardware intrinsic support on x64 without the mulx instruction extension. It adds new BigMul overloads in the X86Base APIs, hooks them into Math.BigMul, and extends the JIT to lower, schedule, and codegen these multi-register intrinsics.
- Introduces
BigMulmethods for 32-, 64-, and pointer-size integers inX86Baseand their platform-not-supported stubs. - Updates
Math.BigMulto prefer the newX86Base.X64.BigMulpath on non-MONO x64, falling back as before. - Enhances JIT (linear scan, lowering, import, list, codegen, tree layout) to recognize and generate
BigMulintrinsics.
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.cs | Added BigMul intrinsics for various operand widths |
| src/System.Private.CoreLib/src/System/Runtime/Intrinsics/X86/X86Base.PlatformNotSupported.cs | Added BigMul stubs throwing on unsupported platforms |
| src/System.Private.CoreLib/src/System/Math.cs | Routed Math.BigMul to use the new intrinsics on x64 |
| src/coreclr/jit/lsraxarch.cpp | Updated register allocator for BigMul multi-reg defs |
| src/coreclr/jit/lowerxarch.cpp | Enabled containment checks for BigMul |
| src/coreclr/jit/hwintrinsicxarch.cpp | Imported BigMul as a multi-register HW intrinsic |
| src/coreclr/jit/hwintrinsiclistxarch.h | Listed BigMul in the x86 and x64 HW intrinsic tables |
| src/coreclr/jit/hwintrinsiccodegenxarch.cpp | Emitting MUL/IMUL sequence for BigMul |
| src/coreclr/jit/hwintrinsic.h | Updated multi-reg return count for BigMul |
| src/coreclr/jit/gentree.cpp | Defined struct layout for BigMul return |
Comments suppressed due to low confidence (3)
src/libraries/System.Private.CoreLib/src/System/Math.cs:205
- Consider adding unit tests that validate the new
Math.BigMul(ulong, ulong, out ulong)path on x64 and ensure correct behavior whenX86Base.X64.IsSupportedis true/false.
#if !MONO // X64.BigMul is not yet implemented in MONO
|
@EgorBo, PTAL. |
|
|
||
| if (rmOp->isUsedFromReg() && rmOp->GetRegNum() == REG_EAX) | ||
| { | ||
| std::swap(rmOp, regOp); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please explain this swap to me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case op2 (rmop) is already present in RAX we use op2 as implicit operand.
Otherwise we would overwrite that value.
I should probably add a comment similar to the one I added for codeGenMulHi
runtime/src/coreclr/jit/codegenxarch.cpp
Lines 868 to 871 in 51a4123
| // If op2 is already present in RAX use that as implicit operand | |
| if (rmOp->isUsedFromReg() && (rmOp->GetRegNum() == REG_RAX)) | |
| { | |
| std::swap(regOp, rmOp); |
Do you want me to add the comment to this PR or #115966 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with either option, I think the PR is not too big to split it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the PR is not too big to split it?
I've updated both PRs.
The 2 reasons to open this as a separate PR was to
- Show that it works for the non-AVX2 path and passes all tests
- For the mulx path I am a bit unsure about the best way to handle the allocation of the RDX register the MULX code produced worse code than MUL for at least one method (you can expand comment to se asm code) . Either approach is better than what's in main but I wanted to avoid adding a "mulx" optimization that gives worse code on average
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this PR is the same as #115966 but without the BMI2 path? Does it have any potential performance impact on BMI2 CPUs or diffs? could you please resolve the conflicts so we can run the diffs (or close one of the PRs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this PR is the same as #115966 but without the BMI2 path?
Yes that is correct
Does it have any potential performance impact on BMI2 CPUs or diffs?
It might be expand "generated code" for edx ret comment for example diff and some thoughts
|
Rerunning failed test. |
This #115966 but without the mulx support.
See the old PR for benchmarks results and generated code
The reason for opening a separate PR is
Feel free to close it if you prefer to work with the original PR