-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Extend the DIV/MOD dividend into RDX:RAX only if needed #1241
Conversation
@mikedn, Thanks for signing the contribution license agreement so quickly! Actual humans will now validate the agreement and then evaluate the PR. |
@mikedn, thanks for the contribution. Someone will review the code soon. One thing to mention, whenever changes are made that impact the code-sequences generated by the JIT, we typically do 2 things: run it over our extensive regression test base, and examine asm (assembly) diffs and benchmark impact. At the moment the capability for non-MS devs to do this is not available. We are planning to expose this capability in the future, but until then I'll probably have someone (probably me) run this change internally for you just to make sure things look good. So this is just a heads up might be a delay of a few days while this gets done. |
Sounds like a lot of trouble for such a small change, especially that this may require additional work. Feel free to close this until the diff capability becomes available. I'd love to see such assembly diffs myself. |
@mikedn Mike, apologies for the delay. There are 396 functions with diffs in our regression base, all are good improvements: 2206 bytes of code size reduction. One nice improvement on hot path in mscorlib::System.Buffer:Memmove(long,long,long) This change looks good to me. And agree about the remaining portions of code that are still using RAX:RDX when in fact any GPR would work. We would be happy to continue to accept PRs in this area. Perhaps we can rig up a crude way for you to inspect mscorlib/corefx asm diffs for small changes using COMPLUS_JitDisasm so you're not blocked on our turn-around time. |
@cmckinsey Thank you for review! The RAX:RDX issue is more or less caused by LSRA's getKillSetForNode special casing GT_MOD & co. It appears that this information could have been passed from lowering to LSRA via TreeNodeInfo but maybe it was decided that the additional space that would be required in TreeNodeInfo is not worth it. If that's indeed the case then the only remaining option is to move this optimization from codegen to lowering so the register allocation doesn't see the GT_MOD at all. Unfortunately that may be a bit problematic in the case of signed integers - it uses CDQ and this isn't directly representable in IR. Some input regarding what's the best approach (changing getKillSetForNode or moving the optimization) is needed here. There are a few other issues related to the constant divisors that are subject to this optimization. One funny case is that this optimization works for negative divisors but fgMorphModByConst steals it because it checks isPow2(cons) instead of isPow2(abs(cons)) like lowering does. These are probably minor issue as the kind of constants that have such problems are less likely to show up as divisors. |
@CarolEidt for feedback on the LSRA interaction with GT_MOD. Regarding the missing cases, I think there is a lot of opportunity for improvement handling DIV/MOD by constants, including MUL by constants. Agreed negative constants may not be as frequent. But we're considering whether to make deeper investments here, perhaps as aggressive as VC/LLVM compilers. Just speaking aloud, but you should still commit your PR. |
I rebased the original commit and added 2 more commits which show/fix some of the DIV/MOD issues. The first of the added commits is a trivial change in The second commit moves the DIV/MOD optimization from codegen to lowering for unsigned integers. This is trivial to do as it simply requires replacing GT_UDIV/GT_UMOD with GT_RSZ/GT_AND and adjusting the constant accordingly. Ideally the same thing should be done for signed integers but I've yet to figure out how, I need to make a copy of the dividend and that seems to require a separate statement and local variable. If this can be done then LSRA doesn't need to be changed. |
@dotnet-bot, test this please. The directed graph test was failing only in windows_debug job (Jenkins URL):
|
The new changes to move this to lower LGTM. @sivarv and @CarolEidt to take a look and provide Mike some guidance on the interaction issue with LSRA. |
@cmckinsey Thanks. Note that as things are now LSRA doesn't need to be modified. Since the unsigned case is handled in lowering only the signed case passes through LSRA and it needs Also, I think it the code would simpler if the optimization could be done in one place. Unfortunately it seems that the signed case is difficult to handle in lower. Maybe it's just me being used to linear IR but this tree IR seems a bit counterproductive. |
@sivarv can you tell mikedn how to make a copy of the dividend in lower so we can have signed/unsigned div/mod expansion at the same place in the phase ordering. |
@mikedn Lowering::TreeNodeInfoInit() - this routine is meant for specifying register requirements and not supposed to make IR modifications. Lowering::LowerNode() is the place where we should place all IR lowering transformations. For example, in this case Lowering::LowerNode() could add a case for unsigned DIV/MOD and lower the node appropriately. And in TreeNodeInfoInit() we just need to assert that we don't get to see an unsigned div/mod which is a power of 2. Look at some of the routines that LowerNode() in turn is calling for examples of how to make IR modifications in Lower. In most cases we might not be required to create new nodes and can just do in-place manipulation of gentree nodes. Where in-place manipulation not possible, we create new gentree nodes, update tree node links and update linear order links (e.g. fgInsertLinearNodeBefore). Rationalize.cpp is another file where you will find some examples of IR manipulation. |
Thanks for assistance. I've move both signed and unsigned div/mod pow2 optimizations in LowerNode. I also added some tests for the various div/mod by const cases as the existing tests don't seem to cover this area well. I'm not 100% sure that it's a good idea to implement the signed case in lowering because the generated code is slightly larger - partly due to the use of a shift instead of |
@@ -499,6 +499,16 @@ void Lowering::LowerNode(GenTreePtr* ppTree, Compiler::fgWalkData* data) | |||
LowerAdd(ppTree, data); | |||
break; | |||
|
|||
case GT_UDIV: | |||
case GT_UMOD: | |||
LowerUDivUMod(*ppTree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about LowerUnsignedDivOrMod()
@cmckinsey @sivarv I haven't looked through diffs, internal test runs, etc. Any concerns here that need to be addressed before merging? Presumably all the changes should be squashed. |
The changes look very good. I would like to see the commits squashed, but otherwise I think we should get this merged. I don't think anyone has taken these changes and done an internal test run, but I believe these changes are disjoint enough from desktop/coreclr differences that it should not be an issue. |
Later today I'll squash this down to 2 commits, one for the tests and one for the actual change (it would be nice to run the tests before and after the actual change). I'll also update commit message asm examples as some have become stale. |
Optimizing GT_DIV/GT_UDIV/GT_MOD/GT_UMOD by power of 2 in codegen is problematic because the xarch DIV instruction has special register requirements. By the time codegen decides to perform the optimization the rax and rdx registers have been already allocated by LSRA even though they're not always needed (as it happens in the case of unsigned division where CDQ isn't used). Since the JIT can't represent a CDQ instruction in its IR an arithmetic shift (GT_RSH) has been instead to extract the dividend sign. xarch's SAR is larger than CDQ but it has the advantage that it doesn't require specific registers. Also, arithmetic shifts are available on architectures other than xarch. Example: method "static int foo(int x) => x / 8;" is now compiled to mov eax, ecx mov edx, eax sar edx, 31 and edx, 7 add edx, eax mov eax, edx sar eax, 3 instead of mov eax, ecx cdq and edx, 7 add eax, edx sar eax, 3 As a side-effect of this change the optimization now also works when the divisor is too large to be contained. Previously this wasn't possible because the divisor constant needed to be modified during codegen but the constant was already loaded into a register. Example: method "static ulong foo(ulong x) => x / 4294967296;" is now compiled to mov rax, rcx shr rax, 32 whereas before a DIV instruction was used. This change also fixes an issue in fgShouldUseMagicNumberDivide. The optimization that is done in lower can handle negative power of 2 divisors but fgShouldUseMagicNumberDivide handled those cases because it didn't check the absolute value of the divisor. Example: method "static int foo(int x) => return x / -2;" is now compiled to mov eax, ecx mov edx, eax shr edx, 31 add edx, eax sar edx, 1 mov eax, edx neg eax instead of mov eax, 0x7FFFFFFF imul edx:eax, ecx mov eax, edx sub eax, ecx mov edx, eax shr edx, 31 add eax, edx
Squashed and updated commit comment to include various generated assembly examples. Also removed a TODO from lowerarm64.cpp as it no longer applies:
|
Fun.
|
@dotnet-bot test OSX x64 Checked Build and Test |
👍 |
On 1st Mar'16, @pgavlin had run asm diffs on desktop with @mikedn 's changes and we have found the following code size regressions: Division by const +2:
Similarly, for computing N % Pow(2)
As you can see from above that these code changes generate one instruction more than what RyuJIT previously generates. @mikedn - Just wondering whether you have reworked your fix since then. If the code size regression exists for the above cases, we might want to track with a Git issue. |
Nope. I mentioned this before code review, the code is slightly larger because it uses a move and an arithmetic shift instead of |
@mikedn Unfortunately, I discovered that this change led to a JIT assert in some internal testing (namely, ngen of System.Windows.Forms.dll with an x64 RyuJIT). I've reverted this change (on the TFS side of the TFS<->GitHub mirror; it will propagate over to GitHub soon). Someone internal to Microsoft will need to generate a repro case for you, or else take your change and debug the problem. The assert:
The tree:
During codegen:
|
I'll try to get the relevant piece of code from System.Windows.Forms and compile it using coreclr. Anyway, that assert seems familiar, I may have an idea about what's going on. |
The assert can be reproduced using the following program (distilled from CheckedListBox:OnDrawItem): class Program
{
int i;
[MethodImpl(MethodImplOptions.NoInlining)]
int Test(int h)
{
int x = h + 4;
int f = (x - i) / 2;
if (i > h)
return 0;
return f;
}
static void Main()
{
new Program().Test(42);
}
} CSE introduces an embedded statement for
|
…Forms.dll ==================== 007551: Merge pull request dotnet#1241 from mikedn/modopt Extend the DIV/MOD dividend into RDX:RAX only if needed ==================== Assert failure(PID 33656 [0x00008378], Thread: 17792 [0x4580]): Assertion failed 'addr->OperIsAddrMode() || (addr->IsCnsIntOrI() && addr->isContained()) || !addr->isContained()' in 'System.Windows.Forms.CheckedListBox:OnDrawItem(ref):this' (IL size 1216) File: e:\dd\projectk\src\ndp\clr\src\jit\emitxarch.cpp Line: 2698 Image: C:\Windows\Microsoft.NET\Framework64\v4.0.rb1605209\mscorsvw.exe The tree: ***** BB41, stmt 82 (embedded) ( 6, 8) [003723] ------------ * stmtExpr void (embedded) (IL 0x109... ???) N1045 ( 3, 2) [000115] ------------ | /--* lclVar ref V00 this u:2 REG rcx $80 N1047 ( 1, 4) [002642] ------------ | +--* const long 344 field offset Fseq[idealCheckSize] REG NA $10b N1049 ( 4, 6) [002643] -------N---- | /--* + byref REG NA $356 N1051 ( 6, 8) [000116] ----GO------ | /--* indir int REG rcx <l:$685, c:$2ef> N1053 ( 6, 8) [003669] DA--GO------ \--* st.lclVar int V172 cse1 rcx REG rcx RV During codegen: Generating BB41, stmt 71 Holding variables: [rbx rsi rdi r12-r15] Generating: N1043 ( 3, 2) [000114] ------------ * lclVar int V05 loc3 u:3 r12 (last use) REG r12 RV $31a Generating: N1045 ( 3, 2) [000115] ------------ * lclVar ref V00 this u:2 REG rcx $80 IN00db: mov rcx, gword ptr [V00 rbp+10H] GC regs: 00000040 {rsi} => 00000042 {rcx rsi} Generating: N1047 ( 1, 4) [002642] ------------ * const long 344 field offset Fseq[idealCheckSize] REG NA $10b Generating: N1049 ( 4, 6) [002643] -------N---- * + byref REG NA $356 Generating: N1051 ( 6, 8) [000116] ----GO------ * indir int REG rcx <l:$685, c:$2ef> ... assert ... (This is rollback dotnet#2: the TFS/GitHub mirror unfortunately undid rollback CS#1605814 with CS#1605840. This change should avoid that problem.) [tfs-changeset: 1605917]
So, what do we do with this? Can this PR be reopened or should I create a new PR with the proper fix? At a minimum the associated issue should be reopened. |
I reopened #1207. I would probably just open a new PR and mention this one for continuity. |
@cmckinsey, are there still plans on making those diff tools available for community? |
@jasonwilliams200OK, @mikedn, I've put some tools out at https://github.com/dotnet/jitutils that allow diffing of jit dasm to be done across multiple platforms. Take a look. Fair warning these are new so there could be some issues - bring me in early if something goes wrong. |
Extend the DIV/MOD dividend into RDX:RAX only if needed Commit migrated from dotnet/coreclr@0075514
…ystem.Windows.Forms.dll ==================== 007551: Merge pull request dotnet/coreclr#1241 from mikedn/modopt Extend the DIV/MOD dividend into RDX:RAX only if needed ==================== Assert failure(PID 33656 [0x00008378], Thread: 17792 [0x4580]): Assertion failed 'addr->OperIsAddrMode() || (addr->IsCnsIntOrI() && addr->isContained()) || !addr->isContained()' in 'System.Windows.Forms.CheckedListBox:OnDrawItem(ref):this' (IL size 1216) File: e:\dd\projectk\src\ndp\clr\src\jit\emitxarch.cpp Line: 2698 Image: C:\Windows\Microsoft.NET\Framework64\v4.0.rb1605209\mscorsvw.exe The tree: ***** BB41, stmt 82 (embedded) ( 6, 8) [003723] ------------ * stmtExpr void (embedded) (IL 0x109... ???) N1045 ( 3, 2) [000115] ------------ | /--* lclVar ref V00 this u:2 REG rcx $80 N1047 ( 1, 4) [002642] ------------ | +--* const long 344 field offset Fseq[idealCheckSize] REG NA $10b N1049 ( 4, 6) [002643] -------N---- | /--* + byref REG NA $356 N1051 ( 6, 8) [000116] ----GO------ | /--* indir int REG rcx <l:$685, c:$2ef> N1053 ( 6, 8) [003669] DA--GO------ \--* st.lclVar int V172 cse1 rcx REG rcx RV During codegen: Generating BB41, stmt 71 Holding variables: [rbx rsi rdi r12-r15] Generating: N1043 ( 3, 2) [000114] ------------ * lclVar int V05 loc3 u:3 r12 (last use) REG r12 RV $31a Generating: N1045 ( 3, 2) [000115] ------------ * lclVar ref V00 this u:2 REG rcx $80 IN00db: mov rcx, gword ptr [V00 rbp+10H] GC regs: 00000040 {rsi} => 00000042 {rcx rsi} Generating: N1047 ( 1, 4) [002642] ------------ * const long 344 field offset Fseq[idealCheckSize] REG NA $10b Generating: N1049 ( 4, 6) [002643] -------N---- * + byref REG NA $356 Generating: N1051 ( 6, 8) [000116] ----GO------ * indir int REG rcx <l:$685, c:$2ef> ... assert ... (This is rollback dotnet/coreclr#2: the TFS/GitHub mirror unfortunately undid rollback CSdotnet/coreclr#1605814 with CSdotnet/coreclr#1605840. This change should avoid that problem.) [tfs-changeset: 1605917] Commit migrated from dotnet/coreclr@ce8e7e3
When the divisor is a power of 2 GT_UMOD/GT_UDIV are emitted as INS_shr/INS_and. This codepath doesn't require the dividend to be in RDX:RAX.
This includes the special case where the divisor is 1 and the dividend can be moved direclty to the target register without going through RAX. This case doesn't appear to be ever hit, probably it is handled earlier.
Fixes #1207
There are quite a few other issues in this area (including the fact that RAX and RDX are reserved anyway) but fixing those would probably require a discussion first. For now a small fix. I can do more if the JIT team agrees.