Skip to content

Adding additional 8 eGPR. #113988

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran DeepakRajendrakumaran commented Mar 27, 2025

Overview

This PR does the following 2 things

  1. Adds an additional 8 eGPR taking the total number of x86 GPR to 32 and total registers to 72
  2. Adds a way to skip unused registers if iterating through RegRecords.
    on x64 without APX for instance, REG16-REG_31 are unused. We iterate through this processing them currently. This allows us to just skip these

TPDIFF impact

From CI for this PR after last few PRs to improve tpdiff(See here for numbers before)

windows x64

Overall (+1.67% to +2.54%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +2.07%
benchmarks.run.windows.x64.checked.mch +1.67%
benchmarks.run_pgo.windows.x64.checked.mch +1.83%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch +1.67%
coreclr_tests.run.windows.x64.checked.mch +2.54%
libraries.crossgen2.windows.x64.checked.mch +2.18%
libraries.pmi.windows.x64.checked.mch +1.82%
libraries_tests.run.windows.x64.Release.mch +2.20%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +1.72%
realworld.run.windows.x64.checked.mch +1.71%
smoke_tests.nativeaot.windows.x64.checked.mch +1.75%
MinOpts (+3.14% to +5.71%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +4.34%
benchmarks.run.windows.x64.checked.mch +4.60%
benchmarks.run_pgo.windows.x64.checked.mch +4.28%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch +4.57%
coreclr_tests.run.windows.x64.checked.mch +3.49%
libraries.crossgen2.windows.x64.checked.mch +4.13%
libraries.pmi.windows.x64.checked.mch +3.14%
libraries_tests.run.windows.x64.Release.mch +4.40%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +3.71%
realworld.run.windows.x64.checked.mch +5.71%
smoke_tests.nativeaot.windows.x64.checked.mch +5.01%
FullOpts (+1.63% to +2.18%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +1.68%
benchmarks.run.windows.x64.checked.mch +1.67%
benchmarks.run_pgo.windows.x64.checked.mch +1.66%
benchmarks.run_pgo_optrepeat.windows.x64.checked.mch +1.67%
coreclr_tests.run.windows.x64.checked.mch +1.84%
libraries.crossgen2.windows.x64.checked.mch +2.18%
libraries.pmi.windows.x64.checked.mch +1.82%
libraries_tests.run.windows.x64.Release.mch +1.63%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +1.68%
realworld.run.windows.x64.checked.mch +1.70%
smoke_tests.nativeaot.windows.x64.checked.mch +1.75%

Alternate test numbers from running superpmi directly and using the 'Total Time':

% Regression = ((Diff Mean - Base Mean)/ Base Mean) * 100)
image

@DeepakRajendrakumaran DeepakRajendrakumaran changed the title Draft : Reduce TP regression in allocateRegistersMinimal() Adding additional 8 eGPR. Apr 21, 2025
@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as ready for review April 21, 2025 14:47
@Copilot Copilot AI review requested due to automatic review settings April 21, 2025 14:47
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review any files in this pull request.

Files not reviewed (11)
  • src/coreclr/jit/compiler.cpp: Language not supported
  • src/coreclr/jit/emit.h: Language not supported
  • src/coreclr/jit/emitxarch.cpp: Language not supported
  • src/coreclr/jit/emitxarch.h: Language not supported
  • src/coreclr/jit/lsra.cpp: Language not supported
  • src/coreclr/jit/lsra.h: Language not supported
  • src/coreclr/jit/lsrabuild.cpp: Language not supported
  • src/coreclr/jit/register.h: Language not supported
  • src/coreclr/jit/target.h: Language not supported
  • src/coreclr/jit/targetamd64.h: Language not supported
  • src/coreclr/jit/unwindamd64.cpp: Language not supported

@DeepakRajendrakumaran
Copy link
Contributor Author

@BruceForstall @jakobbotsch @kunalspathak This PR is ready for review

@BruceForstall
Copy link
Member

Diffs

@kunalspathak
Copy link
Member

Diffs

@DeepakRajendrakumaran - can you double check why TP of arm64 got affected?

image

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Apr 21, 2025

Diffs

@DeepakRajendrakumaran - can you double check why TP of arm64 got affected?

image

This commit affects all targets but should be negligible for non x64 - commit

This is to skip over eGPR for non APX x64 machines when iterating over RegisterRecord. This can be extended to arm64 if there are any registers needed to be skipped over with/without features for arm64. Will need to add something similar - Change for x86

@DeepakRajendrakumaran
Copy link
Contributor Author

@kunalspathak This is how tpdiff looks with just the eGPR and without the logic to skip regs

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions

@@ -239,6 +239,10 @@ typedef uint64_t regMaskSmall;
#define HAS_MORE_THAN_64_REGISTERS 1
#endif // TARGET_ARM64

#ifdef TARGET_AMD64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merge with ARM64 case above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -791,8 +791,10 @@ class emitter
// Note that we use the _idReg1 and _idReg2 fields to hold
// the live gcrefReg mask for the call instructions on x86/x64
//
#if !defined(TARGET_AMD64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move this only for x64?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Space taken up to here" comment, below, is now wrong and needs to be updated, as do the "Space taken up to here" comments that follow.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you move this only for x64?

It's a bit of a weird problem. REGNUM_BITS changing from 6 to 7 results in _idReg1 and _idReg2 taking up an extra bit each. This throws off the instrDesc size for x64 and increases it significantly due to padding. My solution was to move around the position of _idReg1 and _idReg2 to a position for x64 which did not cause addition of as many padding bits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why it matters where _idReg1/_idReg2 are placed to determine padding. All type types here have a base type of unsigned which means that bitfields of them should all be able bit packable. It might make sense for performance reasons to avoid splitting _idReg1/_idReg2 across 32-bit (or maybe even byte) boundaries, but that's not why you did it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference

  • In main with no changes
    image

  • If I just add new registers with no change in order for x64
    image

  • With the change highlighted by you in the PR
    image

Copy link
Contributor Author

@DeepakRajendrakumaran DeepakRajendrakumaran Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i leave it as was, _idReg1 cannot stay within 32 bit boundary since it starts at Byte 3, Bit 2(It was not a problem earlier when it was 6 bits). So, it gets moved to Byte 4, Bit 0

Edit - that's my interpretation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see: a bit field can't extend beyond the limit of the base type (here, unsigned).

btw, is the tool your using to show field offsets public?

Copy link
Contributor Author

@DeepakRajendrakumaran DeepakRajendrakumaran Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The screenshot I used here is from Visual Studio. Click on memory layout from below

image

The memory viewer in Visual Studio Code isn't as intuitive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the "Space taken up to here" comments

@@ -1936,6 +1936,10 @@ void LinearScan::buildPhysRegRecords()
RegRecord* curr = &physRegs[reg];
curr->init(reg);
}
#if defined(TARGET_AMD64)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this fall out naturally? Or are there cases where REG_NEXT(get_REG_INT_LAST()) != REG_FP_FIRST?

Should this be all-platform (no ifdef)?

Copy link
Contributor Author

@DeepakRajendrakumaran DeepakRajendrakumaran Apr 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a x64 specific issue. There are 2 cases

  1. x64 with APX ON
    get_REG_INT_LAST() = REG_R31 and so REG_NEXT(get_REG_INT_LAST()) = REG_FP_FIRST
  2. x64 with APX OFF
    get_REG_INT_LAST() = REG_R15 and so REG_NEXT(get_REG_INT_LAST()) != REG_FP_FIRST
    This is because REG_NEXT(get_REG_INT_LAST()) = REG_R16. REG_R16 is defined but not valid for x64 targets with no APX support

Comment on lines 44 to 45
// TODO-apx: the definition here is incorrect, we will need to revisit this after we extend the register definition.
// for now, we can simply use REX2 as REX.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment still needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have removed this comment

Copy link
Member

@BruceForstall BruceForstall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. Would like to hear @kunalspathak feedback on the LSRA changes.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few questions/comments

@@ -544,6 +545,7 @@ class RegRecord : public Referenceable
Interval* previousInterval;

regNumber regNum;
regNumber nextRegNum; // the next active register.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is adding additional 288 bytes of memory per method compilation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you take a stab at #112959 (comment) and see if that is cheap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not try the regIndices method. I forgot about it. Is this what you had in mind?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you fix the ACTUAL_REG_COUNT in the else portion of isApxSupported, yes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ACTUAL_REG_COUNT

I didn't change ACTUAL_REG_COUNT because I don't think that can be used as the condition as presently used.

The loops look like the following

for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = REG_NEXT(reg))
{
..............
}

AVAILABLE_REG_COUNT can be either of following

  • ACTUAL_REG_COUNT

or

  • ACTUAL_REG_COUNT - (CNT_HIGHFLOAT + CNT_MASK_REGS)

I ended up not changing ACTUAL_REG_COUNT but rather modifying the loop

int index = 1;
for (regNumber reg = (regNumber)regIndices[0]; reg < AVAILABLE_REG_COUNT; reg = (regNumber)regIndices[index++])
{
    ........................
}

This allows me to keep using AVAILABLE_REG_COUNT

Alternative would be to update ACTUAL_REG_COUNT and use it directly in the for loops

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you update it to be something like and that way you are not affecting non-xarch platform? Also, you don't need separate activeRegsCount in that case.

#ifdef TARGET_XARCH
#define NEXT_REGISTER(i) i++, reg = regIndices[i]
#else
#define NEXT_REGISTER(reg) REG_NEXT(reg)
#endif

for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = NEXT_REGISTER(reg)) { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That doesn't work as is. We will need to track an index and a reg. Reason is on x64, if i understand what you propose correctly, the following will happen

when reg - REG_R15 (value = 15)

reg = NEXT_REGISTER(reg) will result in reg - Reg_XMM0(value = 32)

The next time we do NEXT_REGISTER(reg), this will then result in us looking up regIndices[32] which will then point to REG_XMM17

Would something like the following make sense??

// REGISTER_LOOKUP : a lookup table to match register indice to active registers.
#ifdef TARGET_AMD64
#define REGISTER_LOOKUP(regIndex) regIndices[(unsigned)regIndex]
#else
#define REGISTER_LOOKUP(regIndex) regIndex
#endif
regNumber regIndex = REG_FIRST;
for (regNumber reg = REGISTER_LOOKUP(regIndex); reg < AVAILABLE_REG_COUNT; regIndex = REG_NEXT(regIndex), reg = REGISTER_LOOKUP(regIndex))
{
...................
}

I imagine for non x64, this will get optimized to

regNumber regIndex = REG_FIRST;
for (regNumber reg = regIndex ; reg < AVAILABLE_REG_COUNT; regIndex = REG_NEXT(regIndex), reg = regIndex)
{
...................
}

and then


for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg= REG_NEXT(reg))
{
...................
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yes, that's right. May be you can do something like this then?

#ifdef TARGET_XARCH
#define NEXT_REGISTER(reg, index) index++, reg = regIndices[index]
#else
#define NEXT_REGISTER(reg, index) REG_NEXT(reg)
#endif

int index = 0;

for (regNumber reg = REG_FIRST; reg < AVAILABLE_REG_COUNT; reg = NEXT_REGISTER(reg, index)) { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I do that, reg = NEXT_REGISTER(reg, index) will resolve to the following for `x64)

reg = index++, reg = regIndices[index]

Which is why I did the following

Definition and usage

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kunalspathak I have made the requested changes

@BruceForstall BruceForstall added the apx Related to the Intel Advanced Performance Extensions (APX) label Apr 28, 2025
@kunalspathak
Copy link
Member

/azp run runtime-coreclr jitstressregs, runtime-coreclr libraries-jitstressregs

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@kunalspathak
Copy link
Member

Just to note down latest TP numbers:

image

@kunalspathak
Copy link
Member

@DeepakRajendrakumaran - I know you have done lot of work to improve TP. I am wondering did you perform analysis of the remaining regressions and can share where the cost is coming from?

@kunalspathak
Copy link
Member

/azp run runtime-coreclr libraries-jitstress2-jitstressregs

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@DeepakRajendrakumaran
Copy link
Contributor Author

DeepakRajendrakumaran commented Apr 30, 2025

@DeepakRajendrakumaran - I know you have done lot of work to improve TP. I am wondering did you perform analysis of the remaining regressions and can share where the cost is coming from?

I ran this for libraries_tests.run as a sample

The tpdiff for libraries_tests.run is as follows locally

Overall (+2.12%)
Collection PDIFF
libraries_tests.run.windows.x64.Release.mch +2.12%
MinOpts (+4.46%)
Collection PDIFF
libraries_tests.run.windows.x64.Release.mch +4.46%
FullOpts (+1.58%)
Collection PDIFF
libraries_tests.run.windows.x64.Release.mch +1.58%

image

I did try some changes with separating out registers to free in allocateRegisters() and allocateRegistersMinimal(). Handling delay registers made it a little tricky. It could be something someone could circle back to later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
apx Related to the Intel Advanced Performance Extensions (APX) area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants