-
Notifications
You must be signed in to change notification settings - Fork 5.1k
Description
Description
Currently the conversion between Half
and float
is only implemented in software, leading to performance issues.
It would be ideal if Issue #62416 could be resolved, but better software fallback is still needed for environments like Sandy Bridge, which does not support hardware conversion.
Configuration
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
Regression?
No
Data
I benchmarked the code below.
EDIT: Removed data biases.
EDIT2: Added random permutation.
Benchmark code for Half to Single conversion
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Channels;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
namespace HalfConversionBenchmarks
{
public enum InputValueType
{
Sequential,
Permuted,
RandomUniform,
RandomSubnormal,
RandomNormal,
RandomInfNaN
}
[CategoriesColumn]
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByParams, BenchmarkLogicalGroupRule.ByCategory)]
[SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
[DisassemblyDiagnoser(maxDepth: int.MaxValue)]
[AnyCategoriesFilter(CategoryStandard)]
public class HalfToSingleConversionBenchmarks
{
private const string CategorySimple = "Simple";
private const string CategoryStandard = "Standard";
private const string CategoryUnrolled = "Unrolled";
private Half[] bufferA;
private float[] bufferDst;
[Params(65536)]
public int Frames { get; set; }
[Params(InputValueType.Sequential, InputValueType.Permuted)]
public InputValueType InputValue { get; set; }
[GlobalSetup]
public void Setup()
{
var samples = Frames;
bufferDst = new float[samples];
var bA = bufferA = new Half[samples];
var spanA = bA.AsSpan();
switch (InputValue)
{
case InputValueType.Permuted:
FillSequential(spanA);
ref var x9 = ref MemoryMarshal.GetReference(spanA);
var length = spanA.Length;
var olen = length - 2;
for (var i = 0; i < olen; i++)
{
//Using RandomNumberGenerator in order to prevent predictability
var x = RandomNumberGenerator.GetInt32(i, length);
(Unsafe.Add(ref x9, x), Unsafe.Add(ref x9, i)) = (Unsafe.Add(ref x9, i), Unsafe.Add(ref x9, x));
}
break;
case InputValueType.RandomUniform:
RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(spanA));
break;
case InputValueType.RandomSubnormal:
for (var i = 0; i < spanA.Length; i++)
{
var r = (ushort)RandomNumberGenerator.GetInt32(0x7fe);
spanA[i] = BitConverter.UInt16BitsToHalf(ushort.RotateRight(r, 1));
}
break;
case InputValueType.RandomNormal:
for (var i = 0; i < spanA.Length; i++)
{
var r = (ushort)RandomNumberGenerator.GetInt32(0xF000);
spanA[i] = BitConverter.UInt16BitsToHalf((ushort)(ushort.RotateRight(r, 1) + 0x0400u));
}
break;
case InputValueType.RandomInfNaN:
RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(spanA));
for (var i = 0; i < spanA.Length; i++)
{
var r = BitConverter.HalfToUInt16Bits(spanA[i]);
spanA[i] = BitConverter.UInt16BitsToHalf((ushort)(r | 0x7c00u));
}
break;
default:
FillSequential(spanA);
break;
}
static void FillSequential(Span<Half> spanA)
{
for (var i = 0; i < spanA.Length; i++)
{
spanA[i] = BitConverter.UInt16BitsToHalf((ushort)i);
}
}
}
[BenchmarkCategory(CategorySimple, CategoryStandard)]
[Benchmark(Baseline = true)]
public void SimpleLoopStandard()
{
var bA = bufferA.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
}
}
#region Unrolled
[BenchmarkCategory(CategoryUnrolled, CategoryStandard)]
[Benchmark(Baseline = true)]
public void UnrolledLoopStandard()
{
var bA = bufferA.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
var olen = length - 3;
for (; i < olen; i += 4)
{
Unsafe.Add(ref rdi, i + 0) = (float)Unsafe.Add(ref rsi, i + 0);
Unsafe.Add(ref rdi, i + 1) = (float)Unsafe.Add(ref rsi, i + 1);
Unsafe.Add(ref rdi, i + 2) = (float)Unsafe.Add(ref rsi, i + 2);
Unsafe.Add(ref rdi, i + 3) = (float)Unsafe.Add(ref rsi, i + 3);
}
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = (float)Unsafe.Add(ref rsi, i);
}
}
#endregion
}
}
Method | Categories | Frames | InputValue | Mean | Error | StdDev | Ratio | Code Size |
---|---|---|---|---|---|---|---|---|
SimpleLoopStandard | Simple,Standard | 65536 | Sequential | 180.1 μs | 1.51 μs | 1.41 μs | 1.00 | 298 B |
UnrolledLoopStandard | Unrolled,Standard | 65536 | Sequential | 196.4 μs | 1.40 μs | 1.24 μs | 1.00 | 397 B |
SimpleLoopStandard | Simple,Standard | 65536 | Permuted | 372.2 μs | 2.63 μs | 2.33 μs | 1.00 | 298 B |
UnrolledLoopStandard | Unrolled,Standard | 65536 | Permuted | 385.0 μs | 1.05 μs | 0.87 μs | 1.00 | 397 B |
The conversion of sequential values seems to be accelerated in some way, such as branch prediction.
Benchmark code for Single to Half conversion
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics.X86;
using System.Security.Cryptography;
using System.Text;
using System.Threading.Tasks;
using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Configs;
using BenchmarkDotNet.Jobs;
namespace HalfConversionBenchmarks
{
[CategoriesColumn]
[SimpleJob(runtimeMoniker: RuntimeMoniker.HostProcess)]
[GroupBenchmarksBy(BenchmarkLogicalGroupRule.ByParams, BenchmarkLogicalGroupRule.ByCategory)]
[DisassemblyDiagnoser(maxDepth: int.MaxValue)]
[AnyCategoriesFilter(CategoryStandard)]
public class SingleToHalfConversionBenchmarks
{
[Params(65536)]
public int Frames { get; set; }
[ParamsAllValues]
public InputValueType InputValue { get; set; }
private const string CategorySimple = "Simple";
private const string CategoryUnrolled = "Unrolled";
private const string CategoryVectorized = "Vectorized";
private const string CategoryStandard = "Standard";
private float[] bufferSrc;
private Half[] bufferDst;
[GlobalSetup]
public void Setup()
{
var samples = Frames;
var vS = bufferSrc = new float[samples];
bufferDst = new Half[samples];
var vspan = vS.AsSpan();
switch (InputValue)
{
case InputValueType.Permuted:
FillSequential(vspan);
//Random Permutation
ref var x9 = ref MemoryMarshal.GetReference(vspan);
var length = vspan.Length;
var olen = length - 2;
for (var i = 0; i < olen; i++)
{
//Using RandomNumberGenerator in order to prevent predictability
var x = RandomNumberGenerator.GetInt32(i, length);
(Unsafe.Add(ref x9, x), Unsafe.Add(ref x9, i)) = (Unsafe.Add(ref x9, i), Unsafe.Add(ref x9, x));
}
break;
case InputValueType.RandomUniform:
RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(vspan));
break;
case InputValueType.RandomSubnormal:
for (var i = 0; i < vspan.Length; i++)
{
var r = (uint)RandomNumberGenerator.GetInt32(0x70FF_BFFE);
vspan[i] = BitConverter.UInt32BitsToSingle(uint.RotateRight(r, 1));
}
break;
case InputValueType.RandomNormal:
for (var i = 0; i < vspan.Length; i++)
{
var r = (uint)RandomNumberGenerator.GetInt32(0x1E00_1FFE);
vspan[i] = BitConverter.UInt32BitsToSingle(uint.RotateRight(r, 1) + 947904512u);
}
break;
case InputValueType.RandomInfNaN:
RandomNumberGenerator.Fill(MemoryMarshal.AsBytes(vspan));
for (var i = 0; i < vspan.Length; i++)
{
var r = BitConverter.SingleToUInt32Bits(vspan[i]);
vspan[i] = BitConverter.UInt32BitsToSingle(r | 0x7f80_0000u);
}
break;
default:
FillSequential(vspan);
break;
}
static void FillSequential(Span<float> vspan)
{
for (var i = 0; i < vspan.Length; i++)
{
vspan[i] = (float)BitConverter.UInt16BitsToHalf((ushort)i);
}
}
}
[BenchmarkCategory(CategorySimple, CategoryStandard)]
[Benchmark(Baseline = true)]
public void SimpleLoopStandard()
{
var bA = bufferSrc.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = (Half)Unsafe.Add(ref rsi, i);
}
}
#region Unrolled
[BenchmarkCategory(CategoryUnrolled, CategoryStandard)]
[Benchmark]
public void UnrolledLoopStandard()
{
var bA = bufferSrc.AsSpan();
var bD = bufferDst.AsSpan();
ref var rsi = ref MemoryMarshal.GetReference(bA);
ref var rdi = ref MemoryMarshal.GetReference(bD);
nint i = 0, length = Math.Min(bA.Length, bD.Length);
var olen = length - 3;
for (; i < olen; i += 4)
{
Unsafe.Add(ref rdi, i + 0) = (Half)Unsafe.Add(ref rsi, i + 0);
Unsafe.Add(ref rdi, i + 1) = (Half)Unsafe.Add(ref rsi, i + 1);
Unsafe.Add(ref rdi, i + 2) = (Half)Unsafe.Add(ref rsi, i + 2);
Unsafe.Add(ref rdi, i + 3) = (Half)Unsafe.Add(ref rsi, i + 3);
}
for (; i < length; i++)
{
Unsafe.Add(ref rdi, i) = (Half)Unsafe.Add(ref rsi, i);
}
}
#endregion
}
}
Method | Categories | Frames | InputValue | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|---|---|
SimpleLoopStandard | Simple,Standard | 65536 | Sequential | 634.7 μs | 4.77 μs | 4.23 μs | 1.00 | 0.00 | 592 B |
UnrolledLoopStandard | Unrolled,Standard | 65536 | Sequential | 619.9 μs | 2.95 μs | 2.62 μs | ? | ? | 699 B |
SimpleLoopStandard | Simple,Standard | 65536 | Permuted | 674.4 μs | 1.47 μs | 1.22 μs | 1.00 | 0.00 | 592 B |
UnrolledLoopStandard | Unrolled,Standard | 65536 | Permuted | 675.3 μs | 6.84 μs | 6.06 μs | ? | ? | 699 B |
Analysis
Converting Half
to float
The current code looks like a source of inefficiency, using a lot of branches.
By getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.
My proposal for new software fallback converting Half to float
EDIT: The previously proposed algorithm turned out to be slower with new input data.
The code below converts Half
to float
about twice faster than the current implementation.
I've tested this code in test project for all possible 65536 Half
values.
using System.Runtime.CompilerServices;
namespace BetterHalfToSingleConversion
{
public static class HalfUtils
{
[MethodImpl(MethodImplOptions.AggressiveInlining | MethodImplOptions.AggressiveOptimization)]
public static float ConvertHalfToSingle2(Half value)
{
const uint ExponentLowerBound = 0x3880_0000u; //The smallest positive normal number in Half, converted to Single
const uint ExponentOffset = 0x3800_0000u; //BitConverter.SingleToUInt32Bits(1.0f) - ((uint)BitConverter.HalfToUInt16Bits((Half)1.0f) << 13)
const uint FloatSignMask = 0x8000_0000u; //Mask for sign bit in Single
var h = BitConverter.HalfToInt16Bits(value); //Extract the internal representation of value
var v = (uint)(int)h; //Copy sign bit to upper bits
var e = v & 0x7c00u; //Extract exponent bits of value
var c = e == 0u; //true when value is subnormal
var hc = (uint)-Unsafe.As<bool, byte>(ref c); //~0u when c is true, 0 otherwise
var b = e == 0x7c00u; //true when value is either Infinity or NaN
var hb = (uint)-Unsafe.As<bool, byte>(ref b); //~0u when b is true, 0 otherwise
var n = hc & ExponentLowerBound; //n is 0x3880_0000u if c is true, 0 otherwise
var j = ExponentOffset | n; //j is now 0x3880_0000u if value is subnormal, 0x3800_0000u otherwise
v <<= 13; //Match the position of the boundary of exponent bits and fraction bits with IEEE 754 Binary32(Single)
j += j & hb; //Double the j if value is either Infinity or NaN
var s = v & FloatSignMask; //Extract sign bit of value
v &= 0x0FFF_E000; //Extract exponent bits and fraction bits of value
v += j; //Adjust exponent to match the range of exponent
var k = BitConverter.SingleToUInt32Bits(BitConverter.UInt32BitsToSingle(v) - BitConverter.UInt32BitsToSingle(n)); //If value is subnormal, remove unnecessary 1 on top of fraction bits.
return BitConverter.UInt32BitsToSingle(k | s); //Merge sign bit with rest
}
}
}
Test and benchmark code is available in this repository, along with several alternative approaches.
The result is:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.100
[Host] : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
DefaultJob : .NET 7.0.0 (7.0.22.51805), X64 RyuJIT
Method | Categories | Frames | InputValue | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|---|---|
SimpleLoopStandard | Simple,Standard | 65536 | Permuted | 389.1 μs | 3.36 μs | 2.98 μs | 1.00 | 0.00 | 298 B |
SimpleLoopNew2 | Simple,New2 | 65536 | Permuted | 169.8 μs | 1.10 μs | 1.03 μs | ? | ? | 223 B |
UnrolledLoopStandard | Unrolled,Standard | 65536 | Permuted | 388.2 μs | 2.54 μs | 2.37 μs | 1.00 | 0.00 | 397 B |
UnrolledLoopNew2 | Unrolled,New2 | 65536 | Permuted | 154.5 μs | 3.05 μs | 2.85 μs | ? | ? | 745 B |
Converting float
to Half
The current code has a lot of branches, which leads to possible inefficiency.
Again, by getting rid of branches and utilizing floating-point tricks for solving subnormal issues, it IS an improvement for CPUs with fast FPUs.
My proposal for new software fallback converting float to Half
The code below converts float
to Half
twice faster than the current implementation.
I've tested this code in test project for all possible 4,294,967,296 float
values.
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.Arm;
using System.Runtime.Intrinsics.X86;
namespace BetterHalfToSingleConversion
{
public static class HalfUtils
{
//Among several approaches, I selected the fastest one (excluding vectorized ones).
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static Half ConvertSingleToHalf4(float value)
{
var v0 = Vector128.CreateScalarUnsafe(0x3880_0000u); //Minimum exponent for rounding
var v1 = Vector128.CreateScalarUnsafe(0x3800_0000u); //Exponent displacement #1
var v2 = Vector128.CreateScalarUnsafe(0x8000_0000u); //Sign bit
var v3 = Vector128.CreateScalarUnsafe(0x7f80_0000u); //Exponent mask
var v4 = Vector128.CreateScalarUnsafe(0x0680_0000u); //Exponent displacement #2
var v5 = Vector128.CreateScalarUnsafe(65520.0f); //Maximum value that is not Infinity in Half
var v = BitConverter.SingleToUInt32Bits(value);
var vval = Vector128.CreateScalarUnsafe(value);
vval = (vval.AsUInt32() & ~v2).AsSingle(); //Clear sign bit
var s = v & 0x8000_0000u; //Extract sign bit
vval = Vector128.Min(v5, vval); //Rectify values that are Infinity in Half
var w = Vector128.Equals(vval, vval).AsUInt32(); //Detecting NaN(a != a if a is NaN)
var y = Vector128.Max(v0, vval.AsUInt32()); //Rectify lower exponent
y &= v3; //Extract exponent
y += v4; //Add exponent by 13
var z = y - v1; //Subtract exponent from y by 112
z &= w; //Zero whole z if value is NaN
vval += y.AsSingle(); //Round Single into Half's precision(NaN also gets modified here, just setting the MSB of fraction)
vval = (vval.AsUInt32() - v1).AsSingle(); //Subtract exponent by 112
vval -= z.AsSingle(); //Clear Extra leading 1 set in rounding
v = vval.AsUInt32().GetElement(0) >> 13; //Now internal representation is the absolute value represented in Half, shifted 13 bits left, with some exceptions like NaN having strange exponents
s >>>= 16; //Match the position of sign bit
var hc = ~w.GetElement(0) & 0x7C00u; //Only exponent bits will be modified if NaN
v &= 0x7fffu; //Clear the upper unnecessary bits
var gc = hc | s; //Merge sign bit with possible NaN exponent
v &= ~hc; //Clear exponents if value is NaN
v |= gc; //Merge sign bit and possible NaN exponent
return BitConverter.UInt16BitsToHalf((ushort)v); //The final result
}
}
}
Test and benchmark code is available in this repository, along with several alternative approaches.
The benchmark result is as follows:
BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19045
Intel Core i7-4790 CPU 3.60GHz (Haswell), 1 CPU, 8 logical and 4 physical cores
.NET SDK=7.0.200-preview.22628.1
[Host] : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT
DefaultJob : .NET 7.0.2 (7.0.222.60605), X64 RyuJIT
Method | Categories | Frames | InputValue | Mean | Error | StdDev | Ratio | RatioSD | Code Size |
---|---|---|---|---|---|---|---|---|---|
SimpleLoopStandard | Simple,Standard | 65536 | Permuted | 686.6 μs | 5.04 μs | 4.47 μs | 1.00 | 0.00 | 592 B |
SimpleLoopNew4A | Simple,New4,AggressiveInlining | 65536 | Permuted | 327.3 μs | 1.78 μs | 1.58 μs | ? | ? | 370 B |
SimpleLoopNew4U | Simple,New4,InliningUnspecified | 65536 | Permuted | 357.4 μs | 2.60 μs | 2.43 μs | ? | ? | 275 B |
SimpleLoopNew4N | Simple,New4,NoInlining | 65536 | Permuted | 359.1 μs | 2.61 μs | 2.45 μs | ? | ? | 275 B |
UnrolledLoopStandard | Unrolled,Standard | 65536 | Permuted | 676.7 μs | 5.00 μs | 4.68 μs | ? | ? | 699 B |
UnrolledLoopNew4A | Unrolled,New4,AggressiveInlining | 65536 | Permuted | 301.1 μs | 2.68 μs | 2.38 μs | ? | ? | 1,088 B |
UnrolledLoopNew4U | Unrolled,New4,InliningUnspecified | 65536 | Permuted | 354.0 μs | 2.93 μs | 2.60 μs | ? | ? | 382 B |
UnrolledLoopNew4N | Unrolled,New4,NoInlining | 65536 | Permuted | 355.0 μs | 3.19 μs | 2.83 μs | ? | ? | 382 B |