JIT: Recognize 'bt' bit test idiom

### Description

I've been recently developing a chess engine in C# (.NET Core 6), [StockNemo](https://github.com/TheBlackPlague/StockNemo), where when I analyzed the code, RyuJIT was generating assembly far more complex than one would assume it should be. So, I decided to compare it with C++'s GCC compiler (with the `-O3` flag to ensure proper optimization, I imagine the equivalent to dotnet's `Release` configuration) and turns out I was right. 

Consider the following code: 
```cs
readonly ulong Internal = 0x003;

bool GetSetBit(int i) => (Internal >> i & 1UL) == 1UL;
```

**RyuJIT** generates the following assembly for the method `GetSetBit` in release configuration:
```asm
       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       test     al, 1
       setne    al
       movzx    rax, al
       ret      
```

The similar code in C++ looks like this:
```cpp
unsigned long long internal = 0x003;

bool get_set_bit(int i)
{
    return (internal >> i & 1ULL) == 1ULL;
}
```

**GCC 12.1 x86-64** generates the following assembly for the method `get_set_bit` with the `-O3` argument:
```asm
        mov     rax, QWORD PTR internal[rip]
        bt      rax, rdi
        setc    al
        ret
```

As one can see, the GCC-generated assembly is better. There is a way to get the same or nearly as simple and fast assembly as C++,
and that's by arranging the method like so, with its C++ counterpart below:
```cs
bool GetSetBit(int i) 
{
    byte value = (byte)(Internal >> i & 1UL);
    return Unsafe.As<byte, bool>(ref value);
}
```
```cpp
typedef int boolean;
#define true 1
#define false 0

boolean get_set_bit(int i)
{
    return internal >> i & 1ULL;
}
```

The generated assembly for this by RyuJIT is:
```asm
       mov      rax, qword ptr [rdi+8]
       mov      ecx, esi
       shr      rax, cl
       and      eax, 1
       ret      
```
...and by GCC:
```asm
        mov     rax, QWORD PTR internal[rip]
        mov     ecx, edi
        shr     rax, cl
        and     eax, 1
        ret
```

This is just one of many functions that have much more complicated assemblies when generated by RyuJIT (compared to GCC). When micro-optimization is necessary (in chess engines, it is), the generated assemblies are to be as performant. This is not the case by default here; one had to repurpose the code to get the exact same thing. Many times, due to missing language features, this just isn't possible.

I'm not trying to shame or undermine the work done for RyuJIT but requesting better code understanding and generation. I love the C# language (which is why I chose to do the project in C# while knowing C++), and I wish that the code be as fast (or, if possible, faster) as C++. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JIT: Recognize 'bt' bit test idiom #72986

TheBlackPlague
openedon Jul 28, 2022

Description

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

JIT: Recognize 'bt' bit test idiom #72986

Description

TheBlackPlagueopenedon Jul 28, 2022

Description

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

TheBlackPlague
openedon Jul 28, 2022