Description
Description
I've been recently developing a chess engine in C# (.NET Core 6), StockNemo, where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the -O3
flag to ensure proper optimization, I imagine the equivalent to dotnet's Release
configuration) and turns out I was right.
Consider the following code:
readonly ulong Internal = 0x003;
bool GetSetBit(int i) => (Internal >> i & 1UL) == 1UL;
RyuJIT generates the following assembly for the method GetSetBit
in release configuration:
mov rax, qword ptr [rdi+8]
mov ecx, esi
shr rax, cl
test al, 1
setne al
movzx rax, al
ret
The similar code in C++ looks like this:
unsigned long long internal = 0x003;
bool get_set_bit(int i)
{
return (internal >> i & 1ULL) == 1ULL;
}
GCC 12.1 x86-64 generates the following assembly for the method get_set_bit
with the -O3
argument:
mov rax, QWORD PTR internal[rip]
bt rax, rdi
setc al
ret
As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++,
and that's by arranging the method like so, with its C++ counterpart below:
bool GetSetBit(int i)
{
byte value = (byte)(Internal >> i & 1UL);
return Unsafe.As<byte, bool>(ref value);
}
typedef int boolean;
#define true 1
#define false 0
boolean get_set_bit(int i)
{
return internal >> i & 1ULL;
}
The generated assembly for this by RyuJIT is:
mov rax, qword ptr [rdi+8]
mov ecx, esi
shr rax, cl
and eax, 1
ret
...and by GCC:
mov rax, QWORD PTR internal[rip]
mov ecx, edi
shr rax, cl
and eax, 1
ret
This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible.
I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++.