Skip to content

Make Number.FormatBigInteger JIT-firendly #113165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kzrnm
Copy link
Contributor

@kzrnm kzrnm commented Mar 5, 2025

Fixes #111708

After MemoryMarshal was moved from System.Memory to System.Runtime in #106360, crossgen is properly applied to Number.FormatBigInteger. However, the loop logic in FormatBigInteger is significantly slow in tier-0. For more information #111708 (comment)

Discussions about applying AO have taken place in issues like #71261, #79645, #109087 (comment), etc., but these were mainly about frequently called methods like SpanHelpers. Given this, I don't think applying AO to BigInteger would cause any startup performance issues.

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Mar 5, 2025
@EgorBo
Copy link
Member

EgorBo commented Mar 5, 2025

Let's first check if that was the issue:

@EgorBo
Copy link
Member

EgorBo commented Mar 5, 2025

@EgorBot -amd -arm -windows_intel

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System.Numerics;

BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args);

[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class Tests
{
    private char[] _dest = new char[1_000_000];

    private BigInteger _value =
        BigInteger.Pow(BigInteger.Parse(string.Concat(Enumerable.Repeat("1234567890", 30))), 1_000);

    [Benchmark(Baseline = true)]
    public string Stringify() => _value.ToString();

    [Benchmark]
    public bool TryFormat() => _value.TryFormat(_dest, out _);
}

@EgorBo
Copy link
Member

EgorBo commented Mar 5, 2025

@kzrnm While my bot is likely broken for Linux atm (fixing it now), the Windows-x64 job couldn't confirm your theory - EgorBot/runtime-utils#310 (comment) the numbers for before/after are the same.

@kzrnm
Copy link
Contributor Author

kzrnm commented Mar 5, 2025

@EgorBo

When running build.cmd/build.sh, crossgen is properly applied to only System.Private.CoreLib.In other words, libraries generate ReadyToRun DLLs through PublishReadyToRun, so we need to check the assemblies produced by publish.

https://github.com/dotnet/runtime/blob/7dd74fc403dda02fdccc541b306b8267c9a22aa9/src/coreclr/crossgen-corelib.proj

Since I wasn't sure how dotnet/runtime handles publishing, I ran crossgen2.exe directly and used the ReadyToRun assemblies for verification.

The System.Runtime.Numerics.dll output in artifacts is around 130 KB, whereas the one in the actual runtime is 350 KB.

Get-ChildItem $root\runtime\artifacts\bin\System.Runtime.Numerics\Release\net10.0\System.Runtime.Numerics.dll, 'C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.2\System.Runtime.Numerics.dll'

    Directory: D:\repository\dotnet\runtime\artifacts\bin\System.Runtime.Numerics\Release\net10.0

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          2025/03/06     8:06         134656 System.Runtime.Numerics.dll

    Directory: C:\Program Files\dotnet\shared\Microsoft.NETCore.App\9.0.2

Mode                 LastWriteTime         Length Name
----                 -------------         ------ ----
-a---          2025/01/16    21:51         354600 System.Runtime.Numerics.dll

When inspecting the latter with ILSpy, it displays "This assembly contains unmanaged code."

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2025

Ah, I didn't realize BigInteger is not in corelib (although, there is an internal copy there).

Anyway, it doesn't sound like a correct solution to me. AO bypasses all Tier1 optimizations including Dynamic PGO. Today the only method in BCL that has it is AwaitUnsafeOnCompleted that is often on hot path and allocates a lot in tier0 (and we would love to remove AO on it one day - requires some work in the JIT)

@kzrnm
Copy link
Contributor Author

kzrnm commented Mar 6, 2025

@EgorBo I don't think the fact that AO isn't used elsewhere is a valid reason not to use it for BigInteger.

For example, SpanHelpers can be useful for lightweight applications, making R2R a suitable choice. However, BigInteger is purely a numerical computation type, and users are likely to prioritize faster calculations over a slight improvement in application startup time.

Additionally, the regression is occurring because Dynamic PGO and OSR are not being applied, causing execution to remain in Tier-0. While this might be an issue caused by JIT, considering the use case of BigInteger, I believe applying AO as a workaround is a reasonable way to fix the regression. It should not negatively impact users who do not use BigInteger, as no other methods are expected to call BigInteger.ToString().


there is an internal copy there

Are you referring to Number.BigInteger? It appears to be a limited-purpose type used for parsing floating-point number and does not implement ToString().

https://github.com/dotnet/runtime/blob/850c0ab4519b904a28f2d67abdaba1ac78c955ff/src/libraries/System.Private.CoreLib/src/System/Number.BigInteger.cs

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2025

I don't think the fact that AO isn't used elsewhere is a valid reason not to use it for BigInteger.

AO is a workaround, there must be a valid, well explained reason for it to apply. E.g in case of AwaitUnsafeOnCompleted it's blocked by this issue #90965 and removing it caused a massive impact on TE benchmarks' working set (see #90899)

Do we have an issue explaining the exact cause here?

Additionally, the regression is occurring because Dynamic PGO and OSR are not being applied

It should be eventually applied if it's hot enough - if it does not, it's a serious bug we need to investigate. If you need performance right at the start, you should consider disabling TieredCompilation completely or switch to NativeAOT. Generally, AO is expected to decrease the steady state performance - various "is type initialized" checks are typically not elided and no PGO is applied.

@kzrnm kzrnm force-pushed the BigIntegerRegression111708 branch from c769ae3 to 96b47bc Compare March 6, 2025 14:16
@kzrnm
Copy link
Contributor Author

kzrnm commented Mar 6, 2025

Further testing has yielded even better results. By extracting the inner loop into a separate method, the code achieves higher performance without needing AO.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
  [Host]   : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
  ShortRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method Toolchain N Mean Ratio Allocated Alloc Ratio
DecimalString \main\corerun.exe 100 154.5 ns 1.00 - NA
DecimalString \pr\corerun.exe 100 165.2 ns 1.07 - NA
DecimalString \withAO\corerun.exe 100 154.7 ns 1.00 - NA
DecimalString \main\corerun.exe 1000 10,173.4 ns 1.00 - NA
DecimalString \pr\corerun.exe 1000 9,039.6 ns 0.89 - NA
DecimalString \withAO\corerun.exe 1000 9,936.7 ns 0.98 - NA
DecimalString \main\corerun.exe 10000 1,037,805.7 ns 1.00 - NA
DecimalString \pr\corerun.exe 10000 907,048.1 ns 0.87 - NA
DecimalString \withAO\corerun.exe 10000 1,045,343.4 ns 1.01 - NA
DecimalString \main\corerun.exe 100000 209,918,922.2 ns 1.00 - NA
DecimalString \pr\corerun.exe 100000 91,287,872.2 ns 0.43 - NA
DecimalString \withAO\corerun.exe 100000 104,468,880.0 ns 0.50 - NA
DecimalString \main\corerun.exe 1000000 20,839,716,466.7 ns 1.00 - NA
DecimalString \pr\corerun.exe 1000000 9,223,145,333.3 ns 0.44 - NA
DecimalString \withAO\corerun.exe 1000000 10,461,054,700.0 ns 0.50 - NA
Benchmark code
using BenchmarkDotNet.Attributes;
using System.Numerics;

[DisassemblyDiagnoser]
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class ToStringTest
{
    char[] _dest = new char[1000010];
    [Params([
        100,
        1000,
        10000,
        100000,
        1000000,
    ])]
    public int N;

    BigInteger b;

    [GlobalSetup]
    public void Setup()
    {
        b = BigInteger.Parse(new string('9', N));
    }

    [Benchmark] public void DecimalString() => b.TryFormat(_dest, out _);
}

@EgorBo
Copy link
Member

EgorBo commented Mar 6, 2025

Job=ShortRun IterationCount=3 LaunchCount=1 WarmupCount=3

Maybe that is the reason? the fact that you're measuring a short run and it doesn't properly promote to the final tier?
What are the results with default BDN parameters or long run?

@kzrnm
Copy link
Contributor Author

kzrnm commented Mar 6, 2025

Hmm, when measured with the Default job, the main version with 100,000 digits became as fast as the AO version. However, with 1,000,000 digits, the execution time was too long for proper warm-up to take place.

Moreover, suggesting that "for use cases where a single execution of a massive computation takes several seconds, TiredCompilation should be disabled" is an extremely niche piece of knowledge. It is desirable for the program to perform efficiently even when processing a large input only once.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3194)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25125.5
  [Host]     : .NET 10.0.0 (10.0.25.12411), X64 RyuJIT AVX2
  Job-XVYGXJ : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-TFIFOI : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2
  Job-MKOOHF : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2


Method Toolchain N Mean Ratio Code Size Allocated Alloc Ratio
DecimalString \main\corerun.exe 100000 104.64 ms 1.00 5,262 B - NA
DecimalString \pr\corerun.exe 100000 91.53 ms 0.87 5,224 B - NA
DecimalString \withAO\corerun.exe 100000 104.82 ms 1.00 3,821 B - NA
DecimalString \main\corerun.exe 1000000 20,975.64 ms 1.00 274 B - NA
DecimalString \pr\corerun.exe 1000000 9,192.49 ms 0.44 274 B - NA
DecimalString \withAO\corerun.exe 1000000 10,493.09 ms 0.50 274 B - NA

@kzrnm kzrnm changed the title Add AggressiveOptimization to Number.FormatBigInteger Make Number.FormatBigInteger JIT-firendly Mar 6, 2025
@kzrnm
Copy link
Contributor Author

kzrnm commented Apr 12, 2025

I noticed that Tier1 performance has regressed. It looks like a different approach might be needed, so I'll go ahead and close this for now.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3775)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25167.3
  [Host]   : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2
  ShortRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method Toolchain N Mean Ratio Code Size Allocated Alloc Ratio
DecimalString \main\corerun.exe 100000 296.9 ms 1.00 274 B 536 B 1.00
DecimalString \pr\corerun.exe 100000 119.9 ms 0.40 309 B 214 B 0.40
DecimalString2 \main\corerun.exe 100000 162.2 ms 1.00 569 B 268 B 1.00
DecimalString2 \pr\corerun.exe 100000 137.1 ms 0.85 569 B 268 B 1.00
using BenchmarkDotNet.Attributes;
using System.Numerics;

[DisassemblyDiagnoser]
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class ToStringTest
{
    char[] _dest = new char[1000010];
    [Params([
        // 100,
        // 1000,
        // 10000,
        100000,
        // 1000000,
    ])]
    public int N;

    BigInteger b, s;

    [GlobalSetup]
    public void Setup()
    {
        s = BigInteger.Parse(new string('9', 1 << 8));
        b = BigInteger.Parse(new string('9', N));
    }

    [Benchmark] public void DecimalString() => b.TryFormat(_dest, out _);
    [Benchmark]
    public void DecimalString2()
    {
        for (int i = 0; i < 10000; i++)
        {
            s.TryFormat(_dest, out _);
        }
        b.TryFormat(_dest, out _);
    }
}

@kzrnm kzrnm closed this Apr 12, 2025
@kzrnm kzrnm reopened this Apr 12, 2025
@kzrnm
Copy link
Contributor Author

kzrnm commented Apr 12, 2025

@EgorBo
It seems that method calls were the cause of the slowdown.
By hard-coding the logic instead of using method calls, I could improve Tier0 performance without negatively impacting Tier1.


BenchmarkDotNet v0.14.0, Windows 11 (10.0.26100.3775)
13th Gen Intel Core i5-13500, 1 CPU, 20 logical and 14 physical cores
.NET SDK 10.0.100-preview.3.25167.3
  [Host]   : .NET 9.0.1 (9.0.124.61010), X64 RyuJIT AVX2
  ShortRun : .NET 10.0.0 (42.42.42.42424), X64 RyuJIT AVX2

Job=ShortRun  IterationCount=3  LaunchCount=1  
WarmupCount=3  

Method Toolchain N Mean Ratio Code Size Allocated Alloc Ratio
DecimalStringTier0 \main\corerun.exe 100000 277.4 ms 1.00 274 B 536 B 1.00
DecimalStringTier0 \pr\corerun.exe 100000 162.3 ms 0.59 309 B 196 B 0.37
DecimalStringTier1 \main\corerun.exe 100000 155.2 ms 1.00 569 B 268 B 1.00
DecimalStringTier1 \pr\corerun.exe 100000 148.9 ms 0.96 569 B 357 B 1.33
using BenchmarkDotNet.Attributes;
using System.Numerics;

[DisassemblyDiagnoser]
[MemoryDiagnoser(false)]
[HideColumns("Job", "Error", "StdDev", "Median", "RatioSD")]
public class ToStringTest
{
    char[] _dest = new char[1000010];
    [Params([
        100000,
    ])]
    public int N;

    BigInteger b, s;

    [GlobalSetup]
    public void Setup()
    {
        s = BigInteger.Parse(new string('9', 1 << 8));
        b = BigInteger.Parse(new string('9', N));
    }

    [Benchmark] public void DecimalStringTier0() => b.TryFormat(_dest, out _);
    [Benchmark]
    public void DecimalStringTier1()
    {
        for (int i = 0; i < 10000; i++)
        {
            s.TryFormat(_dest, out _);
        }
        b.TryFormat(_dest, out _);
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-System.Numerics community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Perf] BigInteger formatting performance regression in .NET 9
3 participants