Skip to content

JIT: Recognize 'bt' bit test idiom #72986

Closed

Description

Description

I've been recently developing a chess engine in C# (.NET Core 6), StockNemo, where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the -O3 flag to ensure proper optimization, I imagine the equivalent to dotnet's Release configuration) and turns out I was right.

Consider the following code:

readonly ulong Internal = 0x003;

bool GetSetBit(int i) => (Internal >> i & 1UL) == 1UL;

RyuJIT generates the following assembly for the method GetSetBit in release configuration:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       test     al, 1
       setne    al
       movzx    rax, al
       ret      

The similar code in C++ looks like this:

unsigned long long internal = 0x003;

bool get_set_bit(int i)
{
    return (internal >> i & 1ULL) == 1ULL;
}

GCC 12.1 x86-64 generates the following assembly for the method get_set_bit with the -O3 argument:

        mov     rax, QWORD PTR internal[rip]
        bt      rax, rdi
        setc    al
        ret

As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++,
and that's by arranging the method like so, with its C++ counterpart below:

bool GetSetBit(int i) 
{
    byte value = (byte)(Internal >> i & 1UL);
    return Unsafe.As<byte, bool>(ref value);
}
typedef int boolean;
#define true 1
#define false 0

boolean get_set_bit(int i)
{
    return internal >> i & 1ULL;
}

The generated assembly for this by RyuJIT is:

       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       and      eax, 1
       ret      

...and by GCC:

        mov     rax, QWORD PTR internal[rip]
        mov     ecx, edi
        shr     rax, cl
        and     eax, 1
        ret

This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible.

I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

area-CodeGen-coreclrCLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMIgood first issueIssue should be easy to implement, good for first-time contributorshelp wanted[up-for-grabs] Good issue for external contributorstenet-performancePerformance related issue

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions