Improve System.Decimal performance for x64 platform

The current implementation on forwards call to windows implementation of "decimal" 
(VarDecAdd, VarDecSub, VarDecMul) or to the port of these functions found under palrt (as far as I understand the code).

These methods are written optimised for 32bit platforms and by using 64bit instructions it is possible to significantly improve performance.

I have written a small proof of concept to illustrate the gains I is looking forward to try to integrate the code with coreclr, but have some questions and want feedback on how to best integrate it before submitting a PR with those methods.

Questions:
* Can these improvements make its way to desktop clr?
* Which methods are most common?
  I would assume that apart from the basic aritmethic operations that conversions to/from text as well as double/int can be quite common.
* What methods have the most to gain for this ?
  I have on +,-,/ and * for now but there might be some other low hanging fruit.
* How to best integrate it in coreclr code?
 I am thinking along the way of keeping _x64 suffix on these methods (VarDecAdd_x64) 
 and only include the code for x64 platforms.
 This would be be coupled with a macro to redifine VarDecAdd as VarDecAdd_x64 to route all calls to the x64 implementation.
* I  am not really proficient with cmake so if I run into problems with the integration it would be greate if someone was willing to help.

# Proof of Concept

I have created x64 aware methods for the aritmetic instructions Add, Sub, Mul and Div 
based on the current code in coreclr, there are not real changes to algoritms or other logic
apart from changing 32bit aritmetic to 64bit and some of the results of that. 
And using some instrincts for bitsearch and carry propagation.

This is a summary of the measurements from example projekt which can be found at:
https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting

## Measurment

https://github.com/Daniel-Svensson/ClrExperiments/blob/master/ClrDecimal/coreclrtesting/main.cpp

In short program generates a number of "semi random" input where different number of bits are 
set. Then it calls the method under test for all combinations of the input.

#### Results: 

See the results folder (https://github.com/Daniel-Svensson/ClrExperiments/tree/master/ClrDecimal/coreclrtesting/results) 
for complete output results.
I have tried to summarize results for both core i5 2500K and i7 6700K below.
I5 result are with a few minor changes but otherwise same code
, the biggest performance spread is in some of the division tests.

The results below are against the oleauto32 implementation, when compared against 
implementations in palrt the results are very similar since those results are within a few 
% of the timings for oleauto32.

### Multiply

* Measurements removed, se post below for updated values *

### Add / Sub

* Measurements removed, se post below for updated values *

### Div 

Speedup range: 10-270%
For mixed input (all 00...111 bitpatterns, with all scales and signs): ~100%


 Measurment				| Speedup |
------------------|-------------------
| 32 x 32 bit				| >50% | 
| 32 x 32 bit with scale	| ~37% | 
| 64 x 64 bit no scale		| 109-118% | 
| 64 x 64 bit varying scale | 94-| 
| 96 x 96 bit				| >102% | 




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve System.Decimal performance for x64 platform #7778

Proof of Concept

Measurment

Results:

Multiply

Add / Sub

Div

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Measurment	Speedup
32 x 32 bit	>50%
32 x 32 bit with scale	~37%
64 x 64 bit no scale	109-118%
64 x 64 bit varying scale	94-
96 x 96 bit	>102%

Improve System.Decimal performance for x64 platform #7778

Description

Proof of Concept

Measurment

Results:

Multiply

Add / Sub

Div

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions