Skip to content

Improve LSRA to handle more than 64 registers for x64 #112959

Open
@DeepakRajendrakumaran

Description

@DeepakRajendrakumaran

Overview

One of the main features of Intel Advanced Performance Extensions (APX) is the addition of 16 general-purpose registers (GPR) R16-R31, referred to as Extended GPRs (EGPR). This is expected to reduce register spills and further optimize register allocation and by extension code generation for 64-bit x86 for all Intel CPUs (client and server).

Current Design

Enabling this in .NET JIT comes with some challenges. Before APX on x64, there were 56 available registers (16 GPR, 32 SIMD, 8 Mask). The register mask is represented as a struct with a single uint64. Once we add the additional 16 EGPRs, this number will go up to 72(32 GPR, 32 SIMD, 8 Mask).

The only target in .NET that currently supports more than 64 registers is ARM64. The register mask is represented as a struct with 2 uint64s to facilitate this. This adds an increase in throughput (TP). The compromise initially was to add just 8 EGPRs(R16` - R23) so that we do not incur the TP overhead. As such, currently there are 64 available registers on x64 (24 GPR, 32 SIMD, 8 Mask). Eventually we want to add the additional 8 EGPRs as well.

The goal here is to optimize the register allocator so that the TP impact of handling more that 64 registers in register mask when adding the additional 8 EGPRs is mitigated

TP Analysis

Superpmi tpdiff

The current mechanism for handling more than 64 registers in .NET is to define HAS_MORE_THAN_64_REGISTERS in JIT(example showing this in ARM64). I'm able to estimate the TP regression on x64 incurred due to regMaskTP by adding the following

#ifdef TARGET_AMD64
#define HAS_MORE_THAN_64_REGISTERS 1
#endif // TARGET_AMD64

With this I see the following results:

Overall (+2.35% to +3.91%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +3.31%
benchmarks.run.windows.x64.checked.mch +2.52%
benchmarks.run_pgo.windows.x64.checked.mch +3.05%
benchmarks.run_tiered.windows.x64.checked.mch +3.87%
coreclr_tests.run.windows.x64.checked.mch +3.91%
libraries.crossgen2.windows.x64.checked.mch +2.83%
libraries.pmi.windows.x64.checked.mch +2.78%
libraries_tests.run.windows.x64.Release.mch +3.71%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +2.82%
realworld.run.windows.x64.checked.mch +2.72%
smoke_tests.nativeaot.windows.x64.checked.mch +2.35%
MinOpts (+4.88% to +9.02%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +6.73%
benchmarks.run.windows.x64.checked.mch +6.17%
benchmarks.run_pgo.windows.x64.checked.mch +6.65%
benchmarks.run_tiered.windows.x64.checked.mch +6.30%
coreclr_tests.run.windows.x64.checked.mch +5.21%
libraries.crossgen2.windows.x64.checked.mch +4.88%
libraries.pmi.windows.x64.checked.mch +5.96%
libraries_tests.run.windows.x64.Release.mch +7.10%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +5.96%
realworld.run.windows.x64.checked.mch +9.02%
smoke_tests.nativeaot.windows.x64.checked.mch +5.95%
FullOpts (+2.25% to +2.97%)
Collection PDIFF
aspnet.run.windows.x64.checked.mch +2.66%
benchmarks.run.windows.x64.checked.mch +2.52%
benchmarks.run_pgo.windows.x64.checked.mch +2.44%
benchmarks.run_tiered.windows.x64.checked.mch +2.25%
coreclr_tests.run.windows.x64.checked.mch +2.97%
libraries.crossgen2.windows.x64.checked.mch +2.83%
libraries.pmi.windows.x64.checked.mch +2.78%
libraries_tests.run.windows.x64.Release.mch +2.59%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch +2.75%
realworld.run.windows.x64.checked.mch +2.70%
smoke_tests.nativeaot.windows.x64.checked.mch +2.35%
Details

All contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 161,064,735,080 166,397,720,350 +3.31%
benchmarks.run.windows.x64.checked.mch 41,308,501,433 42,350,805,373 +2.52%
benchmarks.run_pgo.windows.x64.checked.mch 98,381,975,916 101,387,423,580 +3.05%
benchmarks.run_tiered.windows.x64.checked.mch 26,004,261,679 27,011,264,368 +3.87%
coreclr_tests.run.windows.x64.checked.mch 817,326,244,833 849,254,654,218 +3.91%
libraries.crossgen2.windows.x64.checked.mch 137,441,277,438 141,329,167,083 +2.83%
libraries.pmi.windows.x64.checked.mch 248,718,642,279 255,626,045,706 +2.78%
libraries_tests.run.windows.x64.Release.mch 812,717,459,383 842,840,166,352 +3.71%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 584,407,937,427 600,860,238,055 +2.82%
realworld.run.windows.x64.checked.mch 51,685,592,215 53,090,744,991 +2.72%
smoke_tests.nativeaot.windows.x64.checked.mch 17,431,036,470 17,840,104,275 +2.35%

MinOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 25,843,634,446 27,583,358,307 +6.73%
benchmarks.run.windows.x64.checked.mch 590,254 626,661 +6.17%
benchmarks.run_pgo.windows.x64.checked.mch 14,277,307,515 15,226,767,812 +6.65%
benchmarks.run_tiered.windows.x64.checked.mch 10,389,908,913 11,044,988,695 +6.30%
coreclr_tests.run.windows.x64.checked.mch 341,564,902,293 359,346,127,943 +5.21%
libraries.crossgen2.windows.x64.checked.mch 2,188,415 2,295,171 +4.88%
libraries.pmi.windows.x64.checked.mch 127,473,141 135,070,097 +5.96%
libraries_tests.run.windows.x64.Release.mch 200,860,048,909 215,123,894,476 +7.10%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 12,038,310,614 12,755,413,292 +5.96%
realworld.run.windows.x64.checked.mch 171,219,702 186,657,985 +9.02%
smoke_tests.nativeaot.windows.x64.checked.mch 1,328,254 1,407,320 +5.95%

FullOpts contexts:

Collection Base # instructions Diff # instructions PDIFF
aspnet.run.windows.x64.checked.mch 135,221,100,634 138,814,362,043 +2.66%
benchmarks.run.windows.x64.checked.mch 41,307,911,179 42,350,178,712 +2.52%
benchmarks.run_pgo.windows.x64.checked.mch 84,104,668,401 86,160,655,768 +2.44%
benchmarks.run_tiered.windows.x64.checked.mch 15,614,352,766 15,966,275,673 +2.25%
coreclr_tests.run.windows.x64.checked.mch 475,761,342,540 489,908,526,275 +2.97%
libraries.crossgen2.windows.x64.checked.mch 137,439,089,023 141,326,871,912 +2.83%
libraries.pmi.windows.x64.checked.mch 248,591,169,138 255,490,975,609 +2.78%
libraries_tests.run.windows.x64.Release.mch 611,857,410,474 627,716,271,876 +2.59%
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch 572,369,626,813 588,104,824,763 +2.75%
realworld.run.windows.x64.checked.mch 51,514,372,513 52,904,087,006 +2.70%
smoke_tests.nativeaot.windows.x64.checked.mch 17,429,708,216 17,838,696,955 +2.35%

Regression Analysis

As a preliminary analysis to identify the biggest culprits, I collected TP data for libraries_tests.run.windows.x64.checked.mch using pin.exe:

See below for methods showing regression and their contribution %

libraries_tests.run.windows.x64

Method InsCountDiff InsPercentageDiff ContributionPercentage
processKills@LinearScan 5849967774 114.36% 17.20%
processBlockStartLocations@LinearScan 3759506797 50.55% 11.05%
allocateRegistersMinimal@LinearScan 3524107314 33.71% 10.36%
allocateRegisters@LinearScan 2453448194 23.65% 7.21%
freeRegisters@LinearScan 1676484336 62.85% 4.93%
genConsumeReg@CodeGen 1643309544 43.33% 4.83%
mergeRegisterPreferences@Interval 1450113208 2682.40% 4.26%
gcMarkRegPtrVal@ 1257681068 167.54% 3.70%
select@RegisterSelection@LinearScan 1061801798 10.19% 3.12%
genCodeForBBlist@CodeGen 819452435 12.69% 2.41%
assignPhysReg@LinearScan 758774971 41.88% 2.23%
buildKillPositionsForNode@LinearScan 715683825 69.42% 2.10%
emitGCregDeadUpd@emitter 631123594 105.72% 1.86%
updateAssignedInterval@LinearScan 553580125 24.24% 1.63%

InsCountDiff = DiffInsCount - BaseInsCount
InsPercentageDiff = ((DiffInsCount - BaseInsCount) / BaseInsCount) * 100
ContributionPercentage = (InsCountDiff * 100) / TotalAbsInsCountDiff)
where TotalAbsInsCountDiff = cumulative instruction count for diff

The main reason for the regression is that the cost of operations on regMaskTP increases.

Consider checking if regMaskTP is empty. This is a relatively simple operation and is as follows

    bool IsEmpty() const
    {
#ifdef HAS_MORE_THAN_64_REGISTERS
        return (low | high) == RBM_NONE;
#else
        return low == RBM_NONE;
#endif
    }

With HAS_MORE_THAN_64_REGISTERS defined, the cost of this goes up. This can be costlier for other methods with branches

// ----------------------------------------------------------
//  AddRegNumForType: Adds `reg` to the mask.
//
void regMaskTP::AddRegNumInMask(regNumber reg)
{
    SingleTypeRegSet value = genSingleTypeRegMask(reg);
#ifdef HAS_MORE_THAN_64_REGISTERS
    if (reg < 64)
    {
        low |= value;
    }
    else
    {
        high |= value;
    }
#else
    low |= value;
#endif
}

Goal

We have 2 data points for TP regression due to addition of registers

  • Adding predicate registers for ARM64

While it's hard to know the exact total TP impact of adding more than 64 registers on ARM64 since this was merged via a number of commits, we do have some numbers based on the PR where this is enabled.(link). The TP hit can be estimated to be between 2.77% to 4.46% overall. This is in line with what we are seeing for x64.

  • Adding high SIMD registers for Avx512

The TP regression incurred when we added the Avx512 SIMD registers(link) was around 0.5%

Considering adding eGPRs is going to be more expensive than when adding the Avx512 SIMD registers due to exceeding the 64 register limit, limiting the overall TP regression to under 2% and running alternate benchmarks to prove limited regressions similar to benchmarking done during Kunal's experiments here should allow us to add the additional registers

Tasks

LSRA Improvements

Possible long term tasks

  • Segregate the gpr/float/predicate registers usage

Metadata

Metadata

Labels

apxRelated to the Intel Advanced Performance Extensions (APX)area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions