Description
Overview
One of the main features of Intel Advanced Performance Extensions (APX) is the addition of 16 general-purpose registers (GPR) R16-R31, referred to as Extended GPRs (EGPR). This is expected to reduce register spills and further optimize register allocation and by extension code generation for 64-bit x86 for all Intel CPUs (client and server).
Current Design
Enabling this in .NET JIT comes with some challenges. Before APX on x64, there were 56 available registers (16 GPR, 32 SIMD, 8 Mask). The register mask is represented as a struct with a single uint64. Once we add the additional 16 EGPRs, this number will go up to 72(32 GPR, 32 SIMD, 8 Mask).
The only target in .NET that currently supports more than 64 registers is ARM64. The register mask is represented as a struct with 2 uint64s to facilitate this. This adds an increase in throughput (TP). The compromise initially was to add just 8 EGPRs(R16` - R23) so that we do not incur the TP overhead. As such, currently there are 64 available registers on x64 (24 GPR, 32 SIMD, 8 Mask). Eventually we want to add the additional 8 EGPRs as well.
The goal here is to optimize the register allocator so that the TP impact of handling more that 64 registers in register mask when adding the additional 8 EGPRs is mitigated
TP Analysis
Superpmi tpdiff
The current mechanism for handling more than 64 registers in .NET is to define HAS_MORE_THAN_64_REGISTERS
in JIT(example showing this in ARM64). I'm able to estimate the TP regression on x64 incurred due to regMaskTP
by adding the following
#ifdef TARGET_AMD64
#define HAS_MORE_THAN_64_REGISTERS 1
#endif // TARGET_AMD64
With this I see the following results:
Overall (+2.35% to +3.91%)
Collection | PDIFF |
---|---|
aspnet.run.windows.x64.checked.mch | +3.31% |
benchmarks.run.windows.x64.checked.mch | +2.52% |
benchmarks.run_pgo.windows.x64.checked.mch | +3.05% |
benchmarks.run_tiered.windows.x64.checked.mch | +3.87% |
coreclr_tests.run.windows.x64.checked.mch | +3.91% |
libraries.crossgen2.windows.x64.checked.mch | +2.83% |
libraries.pmi.windows.x64.checked.mch | +2.78% |
libraries_tests.run.windows.x64.Release.mch | +3.71% |
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch | +2.82% |
realworld.run.windows.x64.checked.mch | +2.72% |
smoke_tests.nativeaot.windows.x64.checked.mch | +2.35% |
MinOpts (+4.88% to +9.02%)
Collection | PDIFF |
---|---|
aspnet.run.windows.x64.checked.mch | +6.73% |
benchmarks.run.windows.x64.checked.mch | +6.17% |
benchmarks.run_pgo.windows.x64.checked.mch | +6.65% |
benchmarks.run_tiered.windows.x64.checked.mch | +6.30% |
coreclr_tests.run.windows.x64.checked.mch | +5.21% |
libraries.crossgen2.windows.x64.checked.mch | +4.88% |
libraries.pmi.windows.x64.checked.mch | +5.96% |
libraries_tests.run.windows.x64.Release.mch | +7.10% |
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch | +5.96% |
realworld.run.windows.x64.checked.mch | +9.02% |
smoke_tests.nativeaot.windows.x64.checked.mch | +5.95% |
FullOpts (+2.25% to +2.97%)
Collection | PDIFF |
---|---|
aspnet.run.windows.x64.checked.mch | +2.66% |
benchmarks.run.windows.x64.checked.mch | +2.52% |
benchmarks.run_pgo.windows.x64.checked.mch | +2.44% |
benchmarks.run_tiered.windows.x64.checked.mch | +2.25% |
coreclr_tests.run.windows.x64.checked.mch | +2.97% |
libraries.crossgen2.windows.x64.checked.mch | +2.83% |
libraries.pmi.windows.x64.checked.mch | +2.78% |
libraries_tests.run.windows.x64.Release.mch | +2.59% |
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch | +2.75% |
realworld.run.windows.x64.checked.mch | +2.70% |
smoke_tests.nativeaot.windows.x64.checked.mch | +2.35% |
Details
All contexts:
Collection | Base # instructions | Diff # instructions | PDIFF |
---|---|---|---|
aspnet.run.windows.x64.checked.mch | 161,064,735,080 | 166,397,720,350 | +3.31% |
benchmarks.run.windows.x64.checked.mch | 41,308,501,433 | 42,350,805,373 | +2.52% |
benchmarks.run_pgo.windows.x64.checked.mch | 98,381,975,916 | 101,387,423,580 | +3.05% |
benchmarks.run_tiered.windows.x64.checked.mch | 26,004,261,679 | 27,011,264,368 | +3.87% |
coreclr_tests.run.windows.x64.checked.mch | 817,326,244,833 | 849,254,654,218 | +3.91% |
libraries.crossgen2.windows.x64.checked.mch | 137,441,277,438 | 141,329,167,083 | +2.83% |
libraries.pmi.windows.x64.checked.mch | 248,718,642,279 | 255,626,045,706 | +2.78% |
libraries_tests.run.windows.x64.Release.mch | 812,717,459,383 | 842,840,166,352 | +3.71% |
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch | 584,407,937,427 | 600,860,238,055 | +2.82% |
realworld.run.windows.x64.checked.mch | 51,685,592,215 | 53,090,744,991 | +2.72% |
smoke_tests.nativeaot.windows.x64.checked.mch | 17,431,036,470 | 17,840,104,275 | +2.35% |
MinOpts contexts:
Collection | Base # instructions | Diff # instructions | PDIFF |
---|---|---|---|
aspnet.run.windows.x64.checked.mch | 25,843,634,446 | 27,583,358,307 | +6.73% |
benchmarks.run.windows.x64.checked.mch | 590,254 | 626,661 | +6.17% |
benchmarks.run_pgo.windows.x64.checked.mch | 14,277,307,515 | 15,226,767,812 | +6.65% |
benchmarks.run_tiered.windows.x64.checked.mch | 10,389,908,913 | 11,044,988,695 | +6.30% |
coreclr_tests.run.windows.x64.checked.mch | 341,564,902,293 | 359,346,127,943 | +5.21% |
libraries.crossgen2.windows.x64.checked.mch | 2,188,415 | 2,295,171 | +4.88% |
libraries.pmi.windows.x64.checked.mch | 127,473,141 | 135,070,097 | +5.96% |
libraries_tests.run.windows.x64.Release.mch | 200,860,048,909 | 215,123,894,476 | +7.10% |
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch | 12,038,310,614 | 12,755,413,292 | +5.96% |
realworld.run.windows.x64.checked.mch | 171,219,702 | 186,657,985 | +9.02% |
smoke_tests.nativeaot.windows.x64.checked.mch | 1,328,254 | 1,407,320 | +5.95% |
FullOpts contexts:
Collection | Base # instructions | Diff # instructions | PDIFF |
---|---|---|---|
aspnet.run.windows.x64.checked.mch | 135,221,100,634 | 138,814,362,043 | +2.66% |
benchmarks.run.windows.x64.checked.mch | 41,307,911,179 | 42,350,178,712 | +2.52% |
benchmarks.run_pgo.windows.x64.checked.mch | 84,104,668,401 | 86,160,655,768 | +2.44% |
benchmarks.run_tiered.windows.x64.checked.mch | 15,614,352,766 | 15,966,275,673 | +2.25% |
coreclr_tests.run.windows.x64.checked.mch | 475,761,342,540 | 489,908,526,275 | +2.97% |
libraries.crossgen2.windows.x64.checked.mch | 137,439,089,023 | 141,326,871,912 | +2.83% |
libraries.pmi.windows.x64.checked.mch | 248,591,169,138 | 255,490,975,609 | +2.78% |
libraries_tests.run.windows.x64.Release.mch | 611,857,410,474 | 627,716,271,876 | +2.59% |
libraries_tests_no_tiered_compilation.run.windows.x64.Release.mch | 572,369,626,813 | 588,104,824,763 | +2.75% |
realworld.run.windows.x64.checked.mch | 51,514,372,513 | 52,904,087,006 | +2.70% |
smoke_tests.nativeaot.windows.x64.checked.mch | 17,429,708,216 | 17,838,696,955 | +2.35% |
Regression Analysis
As a preliminary analysis to identify the biggest culprits, I collected TP data for libraries_tests.run.windows.x64.checked.mch
using pin.exe
:
See below for methods showing regression and their contribution %
libraries_tests.run.windows.x64
Method | InsCountDiff | InsPercentageDiff | ContributionPercentage |
---|---|---|---|
processKills@LinearScan | 5849967774 | 114.36% | 17.20% |
processBlockStartLocations@LinearScan | 3759506797 | 50.55% | 11.05% |
allocateRegistersMinimal@LinearScan | 3524107314 | 33.71% | 10.36% |
allocateRegisters@LinearScan | 2453448194 | 23.65% | 7.21% |
freeRegisters@LinearScan | 1676484336 | 62.85% | 4.93% |
genConsumeReg@CodeGen | 1643309544 | 43.33% | 4.83% |
mergeRegisterPreferences@Interval | 1450113208 | 2682.40% | 4.26% |
gcMarkRegPtrVal@ | 1257681068 | 167.54% | 3.70% |
select@RegisterSelection@LinearScan | 1061801798 | 10.19% | 3.12% |
genCodeForBBlist@CodeGen | 819452435 | 12.69% | 2.41% |
assignPhysReg@LinearScan | 758774971 | 41.88% | 2.23% |
buildKillPositionsForNode@LinearScan | 715683825 | 69.42% | 2.10% |
emitGCregDeadUpd@emitter | 631123594 | 105.72% | 1.86% |
updateAssignedInterval@LinearScan | 553580125 | 24.24% | 1.63% |
InsCountDiff = DiffInsCount - BaseInsCount
InsPercentageDiff = ((DiffInsCount - BaseInsCount) / BaseInsCount) * 100
ContributionPercentage = (InsCountDiff * 100) / TotalAbsInsCountDiff)
where TotalAbsInsCountDiff = cumulative instruction count for diff
The main reason for the regression is that the cost of operations on regMaskTP
increases.
Consider checking if regMaskTP
is empty. This is a relatively simple operation and is as follows
bool IsEmpty() const
{
#ifdef HAS_MORE_THAN_64_REGISTERS
return (low | high) == RBM_NONE;
#else
return low == RBM_NONE;
#endif
}
With HAS_MORE_THAN_64_REGISTERS
defined, the cost of this goes up. This can be costlier for other methods with branches
// ----------------------------------------------------------
// AddRegNumForType: Adds `reg` to the mask.
//
void regMaskTP::AddRegNumInMask(regNumber reg)
{
SingleTypeRegSet value = genSingleTypeRegMask(reg);
#ifdef HAS_MORE_THAN_64_REGISTERS
if (reg < 64)
{
low |= value;
}
else
{
high |= value;
}
#else
low |= value;
#endif
}
Goal
We have 2 data points for TP regression due to addition of registers
- Adding predicate registers for ARM64
While it's hard to know the exact total TP impact of adding more than 64 registers on ARM64 since this was merged via a number of commits, we do have some numbers based on the PR where this is enabled.(link). The TP hit can be estimated to be between 2.77% to 4.46% overall. This is in line with what we are seeing for x64.
- Adding high SIMD registers for Avx512
The TP regression incurred when we added the Avx512 SIMD registers(link) was around 0.5%
Considering adding eGPRs is going to be more expensive than when adding the Avx512 SIMD registers due to exceeding the 64 register limit, limiting the overall TP regression to under 2% and running alternate benchmarks to prove limited regressions similar to benchmarking done during Kunal's experiments here should allow us to add the additional registers
Tasks
LSRA Improvements
Possible long term tasks
- Segregate the gpr/float/predicate registers usage