Optimize conversions between `Half` and `Single`

### Description

Currently the conversion between `Half` and `float` is only implemented in software, leading to performance issues.  
It would be ideal if Issue #62416 could be resolved, but better software fallback is still needed for environments like Sandy Bridge, which does not support hardware conversion.  

### Configuration

``` ini
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
```

### Regression?

No

### Data

I benchmarked the code below.  
EDIT: Removed data biases.
EDIT2: Added random permutation.

<details>
<summary>Benchmark code for Half to Single conversion</summary>

```csharp
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

namespace HalfConversionBenchmarks
{
    public enum InputValueType
    {
        Sequential,
        Permuted,
        RandomUniform,
        RandomSubnormal,
        RandomNormal,
        RandomInfNaN
    }

    [CategoriesColumn]
    [GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByParams, BenchmarkLogicalGroupRule.ByCategory)]
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    [AnyCategoriesFilter(CategoryStandard)]
    public class HalfToSingleConversionBenchmarks
    {
        private const string CategorySimple = "Simple";
        private const string CategoryStandard = "Standard";
        private const string CategoryUnrolled = "Unrolled";

        private Half[] bufferA;
        private float[] bufferDst;

        [Params(65536)]
        public int Frames { get; set; }
        [Params(InputValueType.Sequential, InputValueType.Permuted)]
        public InputValueType InputValue { get; set; }
        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            bufferDst = new float[samples];
            var bA = bufferA = new Half[samples];
            var spanA = bA.AsSpan();
            switch (InputValue)
            {
                case InputValueType.Permuted:
                    FillSequential(spanA);
                    ref var x9 = ref MemoryMarshal.GetReference(spanA);
                    var length = spanA.Length;
                    var olen = length - 2;
                    for (var i = 0; i < olen; i++)
                    {
                        //Using RandomNumberGenerator in order to prevent predictability
                        var x = RandomNumberGenerator.GetInt32(i, length);
                        (Unsafe.Add(ref x9, x), Unsafe.Add(ref x9, i)) = (Unsafe.Add(ref x9, i), Unsafe.Add(ref x9, x));
                    }
                    break;
                case InputValueType.RandomUniform:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(spanA));
                    break;
                case InputValueType.RandomSubnormal:
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = (ushort)RandomNumberGenerator.GetInt32(0x7fe);
                        spanA[i] = BitConverter.UInt16BitsToHalf(ushort.RotateRight(r, 1));
                    }
                    break;
                case InputValueType.RandomNormal:
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = (ushort)RandomNumberGenerator.GetInt32(0xF000);
                        spanA[i] = BitConverter.UInt16BitsToHalf((ushort)(ushort.RotateRight(r, 1) + 0x0400u));
                    }
                    break;
                case InputValueType.RandomInfNaN:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(spanA));
                    for (var i = 0; i < spanA.Length; i++)
                    {
                        var r = BitConverter.HalfToUInt16Bits(spanA[i]);
                        spanA[i] = BitConverter.UInt16BitsToHalf((ushort)(r | 0x7c00u));
                    }
                    break;
                default:
                    FillSequential(spanA);
                    break;
            }
            static void FillSequential(Span<Half> spanA)
            {
                for (var i = 0; i < spanA.Length; i++)
                {
                    spanA[i] = BitConverter.UInt16BitsToHalf((ushort)i);
                }
            }
        }

        [BenchmarkCategory(CategorySimple, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void SimpleLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }

        #region Unrolled

        [BenchmarkCategory(CategoryUnrolled, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void UnrolledLoopStandard()
        {
            var bA = bufferA.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
            }
        }
        #endregion
    }
}
```

|               Method |        Categories | Frames | InputValue |     Mean |   Error |  StdDev | Ratio | Code Size |
|--------------------- |------------------ |------- |----------- |---------:|--------:|--------:|------:|----------:|
|   **SimpleLoopStandard** |   **Simple,Standard** |  **65536** | **Sequential** | **180.1 μs** | **1.51 μs** | **1.41 μs** |  **1.00** |     **298 B** |
|                      |                   |        |            |          |         |         |       |           |
| UnrolledLoopStandard | Unrolled,Standard |  65536 | Sequential | 196.4 μs | 1.40 μs | 1.24 μs |  1.00 |     397 B |
|                      |                   |        |            |          |         |         |       |           |
|   **SimpleLoopStandard** |   **Simple,Standard** |  **65536** |   **Permuted** | **372.2 μs** | **2.63 μs** | **2.33 μs** |  **1.00** |     **298 B** |
|                      |                   |        |            |          |         |         |       |           |
| UnrolledLoopStandard | Unrolled,Standard |  65536 |   Permuted | 385.0 μs | 1.05 μs | 0.87 μs |  1.00 |     397 B |

The conversion of sequential values seems to be accelerated in some way, such as branch prediction.

</details>

<details>
<summary>Benchmark code for Single to Half conversion</summary>

```csharp
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;

namespace HalfConversionBenchmarks
{
    [CategoriesColumn]
    [SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
    [GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByParams, BenchmarkLogicalGroupRule.ByCategory)]
    [DisassemblyDiagnoser(maxDepth: int.MaxValue)]
    [AnyCategoriesFilter(CategoryStandard)]
    public class SingleToHalfConversionBenchmarks
    {
        [Params(65536)]
        public int Frames { get; set; }

        [ParamsAllValues]
        public InputValueType InputValue { get; set; }

        private const string CategorySimple = "Simple";
        private const string CategoryUnrolled = "Unrolled";
        private const string CategoryVectorized = "Vectorized";
        private const string CategoryStandard = "Standard";

        private float[] bufferSrc;
        private Half[] bufferDst;

        [GlobalSetup]
        public void Setup()
        {
            var samples = Frames;
            var vS = bufferSrc = new float[samples];
            bufferDst = new Half[samples];
            var vspan = vS.AsSpan();
            switch (InputValue)
            {
                case InputValueType.Permuted:
                    FillSequential(vspan);
                    //Random Permutation
                    ref var x9 = ref MemoryMarshal.GetReference(vspan);
                    var length = vspan.Length;
                    var olen = length - 2;
                    for (var i = 0; i < olen; i++)
                    {
                        //Using RandomNumberGenerator in order to prevent predictability
                        var x = RandomNumberGenerator.GetInt32(i, length);
                        (Unsafe.Add(ref x9, x), Unsafe.Add(ref x9, i)) = (Unsafe.Add(ref x9, i), Unsafe.Add(ref x9, x));
                    }
                    break;
                case InputValueType.RandomUniform:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(vspan));
                    break;
                case InputValueType.RandomSubnormal:
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = (uint)RandomNumberGenerator.GetInt32(0x70FF_BFFE);
                        vspan[i] = BitConverter.UInt32BitsToSingle(uint.RotateRight(r, 1));
                    }
                    break;
                case InputValueType.RandomNormal:
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = (uint)RandomNumberGenerator.GetInt32(0x1E00_1FFE);
                        vspan[i] = BitConverter.UInt32BitsToSingle(uint.RotateRight(r, 1) + 947904512u);
                    }
                    break;
                case InputValueType.RandomInfNaN:
                    RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(vspan));
                    for (var i = 0; i < vspan.Length; i++)
                    {
                        var r = BitConverter.SingleToUInt32Bits(vspan[i]);
                        vspan[i] = BitConverter.UInt32BitsToSingle(r | 0x7f80_0000u);
                    }
                    break;
                default:
                    FillSequential(vspan);
                    break;
            }

            static void FillSequential(Span<float> vspan)
            {
                for (var i = 0; i < vspan.Length; i++)
                {
                    vspan[i] = (float)BitConverter.UInt16BitsToHalf((ushort)i);
                }
            }
        }
        [BenchmarkCategory(CategorySimple, CategoryStandard)]
        [Benchmark(Baseline = true)]
        public void SimpleLoopStandard()
        {
            var bA = bufferSrc.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (Half)Unsafe.Add(ref rsi, i);
            }
        }
        #region Unrolled
        [BenchmarkCategory(CategoryUnrolled, CategoryStandard)]
        [Benchmark]
        public void UnrolledLoopStandard()
        {
            var bA = bufferSrc.AsSpan();
            var bD = bufferDst.AsSpan();
            ref var rsi = ref MemoryMarshal.GetReference(bA);
            ref var rdi = ref MemoryMarshal.GetReference(bD);
            nint i = 0, length = Math.Min(bA.Length, bD.Length);
            var olen = length - 3;
            for (; i < olen; i += 4)
            {
                Unsafe.Add(ref rdi, i + 0) = (Half)Unsafe.Add(ref rsi, i + 0);
                Unsafe.Add(ref rdi, i + 1) = (Half)Unsafe.Add(ref rsi, i + 1);
                Unsafe.Add(ref rdi, i + 2) = (Half)Unsafe.Add(ref rsi, i + 2);
                Unsafe.Add(ref rdi, i + 3) = (Half)Unsafe.Add(ref rsi, i + 3);
            }
            for (; i < length; i++)
            {
                Unsafe.Add(ref rdi, i) = (Half)Unsafe.Add(ref rsi, i);
            }
        }
        #endregion
    }
}
```

|               Method |        Categories | Frames | InputValue |     Mean |   Error |  StdDev | Ratio | RatioSD | Code Size |
|--------------------- |------------------ |------- |----------- |---------:|--------:|--------:|------:|--------:|----------:|
|   **SimpleLoopStandard** |   **Simple,Standard** |  **65536** | **Sequential** | **634.7 μs** | **4.77 μs** | **4.23 μs** |  **1.00** |    **0.00** |     **592 B** |
|                      |                   |        |            |          |         |         |       |         |           |
| UnrolledLoopStandard | Unrolled,Standard |  65536 | Sequential | 619.9 μs | 2.95 μs | 2.62 μs |     ? |       ? |     699 B |
|                      |                   |        |            |          |         |         |       |         |           |
|   **SimpleLoopStandard** |   **Simple,Standard** |  **65536** |   **Permuted** | **674.4 μs** | **1.47 μs** | **1.22 μs** |  **1.00** |    **0.00** |     **592 B** |
|                      |                   |        |            |          |         |         |       |         |           |
| UnrolledLoopStandard | Unrolled,Standard |  65536 |   Permuted | 675.3 μs | 6.84 μs | 6.06 μs |     ? |       ? |     699 B |

</details>

### Analysis

#### Converting `Half` to `float`

The [current code](https://github.com/dotnet/runtime/blob/621cd59436cb29cab4b1162409ae0947c4bd780d/src/libraries/System.Private.CoreLib/src/System/Half.cs#L599) looks like a source of inefficiency, using a lot of branches.  
By getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.  

<details>
<summary> My proposal for new software fallback converting Half to float </summary>

EDIT: The previously proposed algorithm turned out to be slower with new input data.  
The code below converts `Half` to `float` about twice faster than the current implementation.  
I've tested this code in test project for all possible 65536 `Half` values.  

```csharp
using System.Runtime.CompilerServices;

namespace BetterHalfToSingleConversion
{
    public static class HalfUtils
    {
        [MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
        public static float ConvertHalfToSingle2(Half value)
        {
            const uint ExponentLowerBound = 0x3880_0000u;   //The smallest positive normal number in Half, converted to Single
            const uint ExponentOffset = 0x3800_0000u;       //BitConverter.SingleToUInt32Bits(1.0f) - ((uint)BitConverter.HalfToUInt16Bits((Half)1.0f) << 13)
            const uint FloatSignMask = 0x8000_0000u;        //Mask for sign bit in Single
            var h = BitConverter.HalfToInt16Bits(value);    //Extract the internal representation of value
            var v = (uint)(int)h;   //Copy sign bit to upper bits
            var e = v & 0x7c00u;    //Extract exponent bits of value
            var c = e == 0u;        //true when value is subnormal
            var hc = (uint)-Unsafe.As<bool, byte>(ref c);   //~0u when c is true, 0 otherwise
            var b = e == 0x7c00u;   //true when value is either Infinity or NaN
            var hb = (uint)-Unsafe.As<bool, byte>(ref b);   //~0u when b is true, 0 otherwise
            var n = hc & ExponentLowerBound;    //n is 0x3880_0000u if c is true, 0 otherwise
            var j = ExponentOffset | n;         //j is now 0x3880_0000u if value is subnormal, 0x3800_0000u otherwise
            v <<= 13;                           //Match the position of the boundary of exponent bits and fraction bits with IEEE 754 Binary32(Single)
            j += j & hb;                        //Double the j if value is either Infinity or NaN
            var s = v & FloatSignMask;          //Extract sign bit of value
            v &= 0x0FFF_E000;                   //Extract exponent bits and fraction bits of value
            v += j;                             //Adjust exponent to match the range of exponent
            var k = BitConverter.SingleToUInt32Bits(BitConverter.UInt32BitsToSingle(v) - BitConverter.UInt32BitsToSingle(n));   //If value is subnormal, remove unnecessary 1 on top of fraction bits.
            return BitConverter.UInt32BitsToSingle(k | s);  //Merge sign bit with rest
        }
    }
}
```

Test and benchmark code is available in [this repository](https://github.com/MineCake147E/BetterHalfConversion), along with several alternative approaches.  

The result is:

``` ini

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100
  [Host]     : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
  DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT


```
|               Method |        Categories | Frames | InputValue |     Mean |   Error |  StdDev | Ratio | RatioSD | Code Size |
|--------------------- |------------------ |------- |----------- |---------:|--------:|--------:|------:|--------:|----------:|
|   SimpleLoopStandard |   Simple,Standard |  65536 |   Permuted | 389.1 μs | 3.36 μs | 2.98 μs |  1.00 |    0.00 |     298 B |
|                      |                   |        |            |          |         |         |       |         |           |
|       SimpleLoopNew2 |       Simple,New2 |  65536 |   Permuted | 169.8 μs | 1.10 μs | 1.03 μs |     ? |       ? |     223 B |
|                      |                   |        |            |          |         |         |       |         |           |
| UnrolledLoopStandard | Unrolled,Standard |  65536 |   Permuted | 388.2 μs | 2.54 μs | 2.37 μs |  1.00 |    0.00 |     397 B |
|                      |                   |        |            |          |         |         |       |         |           |
|     UnrolledLoopNew2 |     Unrolled,New2 |  65536 |   Permuted | 154.5 μs | 3.05 μs | 2.85 μs |     ? |       ? |     745 B |


</details>

#### Converting `float` to `Half`

The [current code](https://github.com/dotnet/runtime/blob/00e6482544b435c66279ffd7abf43e9a7ead0236/src/libraries/System.Private.CoreLib/src/System/Half.cs#L608) has a lot of branches, which leads to possible inefficiency.  
Again, by getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.  

<details>
<summary> My proposal for new software fallback converting float to Half </summary>

The code below converts `float` to `Half` twice faster than the current implementation.  
I've tested this code in test project for all possible 4,294,967,296 `float` values.

```csharp
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.Intrinsics.X86;

namespace BetterHalfToSingleConversion
{
    public static class HalfUtils
    {
        //Among several approaches, I selected the fastest one (excluding vectorized ones).
        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static Half ConvertSingleToHalf4(float value)
        {
            var v0 = Vector128.CreateScalarUnsafe(0x3880_0000u); //Minimum exponent for rounding
            var v1 = Vector128.CreateScalarUnsafe(0x3800_0000u); //Exponent displacement #1
            var v2 = Vector128.CreateScalarUnsafe(0x8000_0000u); //Sign bit
            var v3 = Vector128.CreateScalarUnsafe(0x7f80_0000u); //Exponent mask
            var v4 = Vector128.CreateScalarUnsafe(0x0680_0000u); //Exponent displacement #2
            var v5 = Vector128.CreateScalarUnsafe(65520.0f);     //Maximum value that is not Infinity in Half
            var v = BitConverter.SingleToUInt32Bits(value);
            var vval = Vector128.CreateScalarUnsafe(value);
            vval = (vval.AsUInt32() & ~v2).AsSingle();  //Clear sign bit
            var s = v & 0x8000_0000u;       //Extract sign bit
            vval = Vector128.Min(v5, vval); //Rectify values that are Infinity in Half
            var w = Vector128.Equals(vval, vval).AsUInt32();   //Detecting NaN(a != a if a is NaN)
            var y = Vector128.Max(v0, vval.AsUInt32()); //Rectify lower exponent
            y &= v3;        //Extract exponent
            y += v4;        //Add exponent by 13
            var z = y - v1; //Subtract exponent from y by 112
            z &= w;         //Zero whole z if value is NaN
            vval += y.AsSingle();                       //Round Single into Half's precision(NaN also gets modified here, just setting the MSB of fraction)
            vval = (vval.AsUInt32() - v1).AsSingle();   //Subtract exponent by 112
            vval -= z.AsSingle();                       //Clear Extra leading 1 set in rounding
            v = vval.AsUInt32().GetElement(0) >> 13;    //Now internal representation is the absolute value represented in Half, shifted 13 bits left, with some exceptions like NaN having strange exponents
            s >>>= 16;                              //Match the position of sign bit
            var hc = ~w.GetElement(0) & 0x7C00u;    //Only exponent bits will be modified if NaN
            v &= 0x7fffu;       //Clear the upper unnecessary bits
            var gc = hc | s;    //Merge sign bit with possible NaN exponent
            v &= ~hc;           //Clear exponents if value is NaN
            v |= gc;            //Merge sign bit and possible NaN exponent
            return BitConverter.UInt16BitsToHalf((ushort)v);    //The final result
        }
    }
}
```

Test and benchmark code is available in [this repository](https://github.com/MineCake147E/BetterHalfConversion), along with several alternative approaches.  
The benchmark result is as follows:

``` ini

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.200-preview.22628.1
  [Host]     : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT
  DefaultJob : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT


```
|               Method |                        Categories | Frames | InputValue |     Mean |   Error |  StdDev | Ratio | RatioSD | Code Size |
|--------------------- |---------------------------------- |------- |----------- |---------:|--------:|--------:|------:|--------:|----------:|
|   SimpleLoopStandard |                   Simple,Standard |  65536 |   Permuted | 686.6 μs | 5.04 μs | 4.47 μs |  1.00 |    0.00 |     592 B |
|                      |                                   |        |            |          |         |         |       |         |           |
|      SimpleLoopNew4A |    Simple,New4,AggressiveInlining |  65536 |   Permuted | 327.3 μs | 1.78 μs | 1.58 μs |     ? |       ? |     370 B |
|                      |                                   |        |            |          |         |         |       |         |           |
|      SimpleLoopNew4U |   Simple,New4,InliningUnspecified |  65536 |   Permuted | 357.4 μs | 2.60 μs | 2.43 μs |     ? |       ? |     275 B |
|                      |                                   |        |            |          |         |         |       |         |           |
|      SimpleLoopNew4N |            Simple,New4,NoInlining |  65536 |   Permuted | 359.1 μs | 2.61 μs | 2.45 μs |     ? |       ? |     275 B |
|                      |                                   |        |            |          |         |         |       |         |           |
| UnrolledLoopStandard |                 Unrolled,Standard |  65536 |   Permuted | 676.7 μs | 5.00 μs | 4.68 μs |     ? |       ? |     699 B |
|                      |                                   |        |            |          |         |         |       |         |           |
|    UnrolledLoopNew4A |  Unrolled,New4,AggressiveInlining |  65536 |   Permuted | 301.1 μs | 2.68 μs | 2.38 μs |     ? |       ? |   1,088 B |
|                      |                                   |        |            |          |         |         |       |         |           |
|    UnrolledLoopNew4U | Unrolled,New4,InliningUnspecified |  65536 |   Permuted | 354.0 μs | 2.93 μs | 2.60 μs |     ? |       ? |     382 B |
|                      |                                   |        |            |          |         |         |       |         |           |
|    UnrolledLoopNew4N |          Unrolled,New4,NoInlining |  65536 |   Permuted | 355.0 μs | 3.19 μs | 2.83 μs |     ? |       ? |     382 B |


</details>


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize conversions between `Half` and `Single` #69667

Description

Configuration

Regression?

Data

Analysis

Converting `Half` to `float`

Converting `float` to `Half`

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	Categories	Frames	InputValue	Mean	Error	StdDev	Ratio	Code Size
SimpleLoopStandard	Simple,Standard	65536	Sequential	180.1 μs	1.51 μs	1.41 μs	1.00	298 B

UnrolledLoopStandard	Unrolled,Standard	65536	Sequential	196.4 μs	1.40 μs	1.24 μs	1.00	397 B

SimpleLoopStandard	Simple,Standard	65536	Permuted	372.2 μs	2.63 μs	2.33 μs	1.00	298 B

UnrolledLoopStandard	Unrolled,Standard	65536	Permuted	385.0 μs	1.05 μs	0.87 μs	1.00	397 B

Method	Categories	Frames	InputValue	Mean	Error	StdDev	Ratio	RatioSD	Code Size
SimpleLoopStandard	Simple,Standard	65536	Permuted	389.1 μs	3.36 μs	2.98 μs	1.00	0.00	298 B

SimpleLoopNew2	Simple,New2	65536	Permuted	169.8 μs	1.10 μs	1.03 μs	?	?	223 B

UnrolledLoopStandard	Unrolled,Standard	65536	Permuted	388.2 μs	2.54 μs	2.37 μs	1.00	0.00	397 B

UnrolledLoopNew2	Unrolled,New2	65536	Permuted	154.5 μs	3.05 μs	2.85 μs	?	?	745 B

Optimize conversions between Half and Single #69667

Description

Description

Configuration

Regression?

Data

Analysis

Converting Half to float

Converting float to Half

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Optimize conversions between `Half` and `Single` #69667

Converting `Half` to `float`

Converting `float` to `Half`