Skip to content

Improve System.Decimal performance for x64 platform #7778

Closed
@Daniel-Svensson

Description

@Daniel-Svensson

The current implementation on forwards call to windows implementation of "decimal"
(VarDecAdd, VarDecSub, VarDecMul) or to the port of these functions found under palrt (as far as I understand the code).

These methods are written optimised for 32bit platforms and by using 64bit instructions it is possible to significantly improve performance.

I have written a small proof of concept to illustrate the gains I is looking forward to try to integrate the code with coreclr, but have some questions and want feedback on how to best integrate it before submitting a PR with those methods.

Questions:

  • Can these improvements make its way to desktop clr?
  • Which methods are most common?
    I would assume that apart from the basic aritmethic operations that conversions to/from text as well as double/int can be quite common.
  • What methods have the most to gain for this ?
    I have on +,-,/ and * for now but there might be some other low hanging fruit.
  • How to best integrate it in coreclr code?
    I am thinking along the way of keeping _x64 suffix on these methods (VarDecAdd_x64)
    and only include the code for x64 platforms.
    This would be be coupled with a macro to redifine VarDecAdd as VarDecAdd_x64 to route all calls to the x64 implementation.
  • I am not really proficient with cmake so if I run into problems with the integration it would be greate if someone was willing to help.

Proof of Concept

I have created x64 aware methods for the aritmetic instructions Add, Sub, Mul and Div
based on the current code in coreclr, there are not real changes to algoritms or other logic
apart from changing 32bit aritmetic to 64bit and some of the results of that.
And using some instrincts for bitsearch and carry propagation.

This is a summary of the measurements from example projekt which can be found at:
https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting

Measurment

https://github.com/Daniel-Svensson/ClrExperiments/blob/master/ClrDecimal/coreclrtesting/main.cpp

In short program generates a number of "semi random" input where different number of bits are
set. Then it calls the method under test for all combinations of the input.

Results:

See the results folder (https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting/results)
for complete output results.
I have tried to summarize results for both core i5 2500K and i7 6700K below.
I5 result are with a few minor changes but otherwise same code
, the biggest performance spread is in some of the division tests.

The results below are against the oleauto32 implementation, when compared against
implementations in palrt the results are very similar since those results are within a few
% of the timings for oleauto32.

Multiply

  • Measurements removed, se post below for updated values *

Add / Sub

  • Measurements removed, se post below for updated values *

Div

Speedup range: 10-270%
For mixed input (all 00...111 bitpatterns, with all scales and signs): ~100%

Measurment Speedup
32 x 32 bit >50%
32 x 32 bit with scale ~37%
64 x 64 bit no scale 109-118%
64 x 64 bit varying scale 94-
96 x 96 bit >102%

Metadata

Metadata

Assignees

No one assigned

    Labels

    area-System.RuntimeenhancementProduct code improvement that does NOT require public API changes/additionstenet-performancePerformance related issue

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions