Consider adding a LOCK CMPXCHG16B intrinsic method #28711
Description
The CMPXCHG16B instruction is required to do CAS or atomic read of 128-bits values in memory. Currently, atomic 64-bits read and CAS is supported on .NET with the Interlocked.CompareExchange
and Interlocked.Read
, however the same operations are not support for 128-bits values.
I believe that the main problem is that the required instruction (CMPXCHG16B) is not supported on all CPUs, for example, it is not supported by some very old AMD CPUs, however it is a requirement to run Windows 8.1 and 10, so I beleive that the amount of CPUs were this instruction is not supported is very small.
Due to the above limitation, I beleive that the best way to support it is through an intrinsic, and the user can check if the instruction is supported on the current CPU, much like the other IsSupported
properties that are exposed on the other ISA classes. The API would be something like this:
namespace System.Runtime.Intrinsics.X86
{
public static class Cx16
{
// Cx16 flag check using the CPUID instruction (cached).
public static bool IsSupported { get; }
// Returns the old value at destination.
public static Int128 InterlockedCompareExchange16Bytes(Int128* destination, Int128 value, Int128 comparand) { throw new PlatformNotSupportedException(); }
// Returns true if the store was successful (*destination == comparand), and false otherwise.
public static bool InterlockedCompareExchange16BytesEqual(Int128* destination, Int128 value, Int128 comparand) { throw new PlatformNotSupportedException(); }
}
}
It uses an Int128
type that is not yet available, but AFAIK work is being done to add it (dotnet/corefxlab#2635).
Another alternative is passing the value as two 64-bits values (the low and high parts of the 128-bits value). Afterall, the instruction uses 2 64-bits registers. I beleive the main problem which this solution is returning the 128-bits value.
The CMPXCHG16B sets the zero flag, if the values at destination and the comparand are equal, and clears it otherwise. So, I included a method that returns bool (it would just return the ZF value basically), since it should have better codegen for the case where the user just wants to know if the two values are equal, and the store succeeded. On some cases, getting the value that is currently at destination
is necessary (for example, when the user just wants to do a atomic 128-bits read), so in this case, the method returning a Int128
can be used (an example is provided below, with the AtomicRead128
method). The method returning a bool can be replaced with the one returning a Int128
, by comparing the returned value with the comparand value, it has slightly worse codegen, but the same end result.
It's also worth noting that this instruction has alignment requirements, and the address should be 16 bytes aligned. I believe that the LoadAligned
SSE intrinsic method had a similar problem, so peharps this can be handled in a similar way?
Example usage, an atomic 128-bits increment, just for illustration purposes:
public static Int128 AtomicIncrement128(Int128* destination)
{
Int128 oldValue, newValue;
do
{
oldValue = AtomicRead128(destination);
newValue = oldValue + 1;
}
while (!InterlockedCompareExchange16BytesEqual(destination, newValue, oldValue);
return oldValue;
}
private static Int128 AtomicRead128(Int128* source)
{
// Note: Will cause an access violation for read-only mapped regions,
// because CMPXCHG16B always performs a write, even if the the store "fails".
return InterlockedCompareExchange16Bytes(source, Int128.Zero, Int128.Zero);
}
It may be worth noting (in case a implementation on Interlocked
is desired) that it's also possible to implement this on ARM64, by using LDAXP/CMP/STLXP instruction sequences with two 64-bits registers.