-
Notifications
You must be signed in to change notification settings - Fork 5k
Adding additional 8 eGPR. #113988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adding additional 8 eGPR. #113988
Conversation
657a419
to
50153a7
Compare
5e5548c
to
85c9ef8
Compare
3a45505
to
963c341
Compare
This reverts commit cc6d454.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot wasn't able to review any files in this pull request.
Files not reviewed (11)
- src/coreclr/jit/compiler.cpp: Language not supported
- src/coreclr/jit/emit.h: Language not supported
- src/coreclr/jit/emitxarch.cpp: Language not supported
- src/coreclr/jit/emitxarch.h: Language not supported
- src/coreclr/jit/lsra.cpp: Language not supported
- src/coreclr/jit/lsra.h: Language not supported
- src/coreclr/jit/lsrabuild.cpp: Language not supported
- src/coreclr/jit/register.h: Language not supported
- src/coreclr/jit/target.h: Language not supported
- src/coreclr/jit/targetamd64.h: Language not supported
- src/coreclr/jit/unwindamd64.cpp: Language not supported
@BruceForstall @jakobbotsch @kunalspathak This PR is ready for review |
@DeepakRajendrakumaran - can you double check why TP of arm64 got affected? |
This commit affects all targets but should be negligible for non x64 - commit This is to skip over eGPR for non APX x64 machines when iterating over RegisterRecord. This can be extended to arm64 if there are any registers needed to be skipped over with/without features for arm64. Will need to add something similar - Change for x86 |
@kunalspathak This is how tpdiff looks with just the eGPR and without the logic to skip regs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions
src/coreclr/jit/target.h
Outdated
@@ -239,6 +239,10 @@ typedef uint64_t regMaskSmall; | |||
#define HAS_MORE_THAN_64_REGISTERS 1 | |||
#endif // TARGET_ARM64 | |||
|
|||
#ifdef TARGET_AMD64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merge with ARM64 case above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -791,8 +791,10 @@ class emitter | |||
// Note that we use the _idReg1 and _idReg2 fields to hold | |||
// the live gcrefReg mask for the call instructions on x86/x64 | |||
// | |||
#if !defined(TARGET_AMD64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you move this only for x64?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "Space taken up to here" comment, below, is now wrong and needs to be updated, as do the "Space taken up to here" comments that follow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you move this only for x64?
It's a bit of a weird problem. REGNUM_BITS
changing from 6
to 7
results in _idReg1
and _idReg2
taking up an extra bit each. This throws off the instrDesc
size for x64
and increases it significantly due to padding. My solution was to move around the position of _idReg1
and _idReg2
to a position for x64
which did not cause addition of as many padding bits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why it matters where _idReg1
/_idReg2
are placed to determine padding. All type types here have a base type of unsigned
which means that bitfields of them should all be able bit packable. It might make sense for performance reasons to avoid splitting _idReg1
/_idReg2
across 32-bit (or maybe even byte) boundaries, but that's not why you did it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If i leave it as was, _idReg1 cannot stay within 32 bit boundary since it starts at Byte 3, Bit 2(It was not a problem earlier when it was 6 bits). So, it gets moved to Byte 4, Bit 0
Edit - that's my interpretation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see: a bit field can't extend beyond the limit of the base type (here, unsigned
).
btw, is the tool your using to show field offsets public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the "Space taken up to here" comments
src/coreclr/jit/lsrabuild.cpp
Outdated
@@ -1936,6 +1936,10 @@ void LinearScan::buildPhysRegRecords() | |||
RegRecord* curr = &physRegs[reg]; | |||
curr->init(reg); | |||
} | |||
#if defined(TARGET_AMD64) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this fall out naturally? Or are there cases where REG_NEXT(get_REG_INT_LAST()) != REG_FP_FIRST
?
Should this be all-platform (no ifdef)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a x64
specific issue. There are 2 cases
x64
withAPX ON
get_REG_INT_LAST()
=REG_R31
and soREG_NEXT(get_REG_INT_LAST()) = REG_FP_FIRST
x64
withAPX OFF
get_REG_INT_LAST()
=REG_R15
and soREG_NEXT(get_REG_INT_LAST()) != REG_FP_FIRST
This is becauseREG_NEXT(get_REG_INT_LAST())
=REG_R16
.REG_R16
is defined but not valid forx64
targets with no APX support
src/coreclr/jit/emitxarch.h
Outdated
// TODO-apx: the definition here is incorrect, we will need to revisit this after we extend the register definition. | ||
// for now, we can simply use REX2 as REX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this comment still needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have removed this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. Would like to hear @kunalspathak feedback on the LSRA changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few questions/comments
src/coreclr/jit/lsra.h
Outdated
@@ -544,6 +545,7 @@ class RegRecord : public Referenceable | |||
Interval* previousInterval; | |||
|
|||
regNumber regNum; | |||
regNumber nextRegNum; // the next active register. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is adding additional 288 bytes of memory per method compilation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you take a stab at #112959 (comment) and see if that is cheap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not try the regIndices
method. I forgot about it. Is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming you fix the ACTUAL_REG_COUNT
in the else
portion of isApxSupported
, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACTUAL_REG_COUNT
I didn't change ACTUAL_REG_COUNT
because I don't think that can be used as the condition as presently used.
The loops look like the following
for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = REG_NEXT(reg))
{
..............
}
AVAILABLE_REG_COUNT
can be either of following
ACTUAL_REG_COUNT
or
ACTUAL_REG_COUNT - (CNT_HIGHFLOAT + CNT_MASK_REGS)
I ended up not changing ACTUAL_REG_COUNT
but rather modifying the loop
int index = 1;
for (regNumber reg = (regNumber)regIndices[0]; reg < AVAILABLE_REG_COUNT; reg = (regNumber)regIndices[index++])
{
........................
}
This allows me to keep using AVAILABLE_REG_COUNT
Alternative would be to update ACTUAL_REG_COUNT
and use it directly in the for loops
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you update it to be something like and that way you are not affecting non-xarch platform? Also, you don't need separate activeRegsCount
in that case.
#ifdef TARGET_XARCH
#define NEXT_REGISTER(i) i++, reg = regIndices[i]
#else
#define NEXT_REGISTER(reg) REG_NEXT(reg)
#endif
for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = NEXT_REGISTER(reg)) { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't work as is. We will need to track an index and a reg. Reason is on x64
, if i understand what you propose correctly, the following will happen
when reg - REG_R15 (value = 15)
reg = NEXT_REGISTER(reg) will result in reg - Reg_XMM0(value = 32)
The next time we do NEXT_REGISTER(reg), this will then result in us looking up regIndices[32] which will then point to REG_XMM17
Would something like the following make sense??
// REGISTER_LOOKUP : a lookup table to match register indice to active registers.
#ifdef TARGET_AMD64
#define REGISTER_LOOKUP(regIndex) regIndices[(unsigned)regIndex]
#else
#define REGISTER_LOOKUP(regIndex) regIndex
#endif
regNumber regIndex = REG_FIRST;
for (regNumber reg = REGISTER_LOOKUP(regIndex); reg < AVAILABLE_REG_COUNT; regIndex = REG_NEXT(regIndex), reg = REGISTER_LOOKUP(regIndex))
{
...................
}
I imagine for non x64, this will get optimized to
regNumber regIndex = REG_FIRST;
for (regNumber reg = regIndex ; reg < AVAILABLE_REG_COUNT; regIndex = REG_NEXT(regIndex), reg = regIndex)
{
...................
}
and then
for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg= REG_NEXT(reg))
{
...................
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah yes, that's right. May be you can do something like this then?
#ifdef TARGET_XARCH
#define NEXT_REGISTER(reg, index) index++, reg = regIndices[index]
#else
#define NEXT_REGISTER(reg, index) REG_NEXT(reg)
#endif
int index = 0;
for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = NEXT_REGISTER(reg, index)) { ... }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I do that, reg = NEXT_REGISTER(reg, index)
will resolve to the following for `x64)
reg = index++, reg = regIndices[index]
Which is why I did the following
Definition and usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak I have made the requested changes
/azp run runtime-coreclr jitstressregs, runtime-coreclr libraries-jitstressregs |
Azure Pipelines successfully started running 2 pipeline(s). |
@DeepakRajendrakumaran - I know you have done lot of work to improve TP. I am wondering did you perform analysis of the remaining regressions and can share where the cost is coming from? |
/azp run runtime-coreclr libraries-jitstress2-jitstressregs |
Azure Pipelines successfully started running 1 pipeline(s). |
I ran this for The tpdiff for Overall (+2.12%)
MinOpts (+4.46%)
FullOpts (+1.58%)
I did try some changes with separating out registers to free in |
Overview
This PR does the following 2 things
on x64 without APX for instance, REG16-REG_31 are unused. We iterate through this processing them currently. This allows us to just skip these
TPDIFF impact
From CI for this PR after last few PRs to improve tpdiff(See here for numbers before)
windows x64
Overall (+1.67% to +2.54%)
MinOpts (+3.14% to +5.71%)
FullOpts (+1.63% to +2.18%)
Alternate test numbers from running superpmi directly and using the 'Total Time':
% Regression = ((Diff Mean - Base Mean)/ Base Mean) * 100)