Scalar BFloat16 to float conversion has very suboptimal codegen on GCC

Hi, and first of all thanks for a great library!

When doing benchmarking of the dot product kernels included in Highway (`contrib/dot/dot-inl.h`) when compiled under GCC 14.2 for x64 targets, I noticed some rather interesting performance characteristics of the BFloat16 overload.

Here's a graph showing input length (in bf16 elements) vs latency in nanoseconds:

<img width="986" height="746" alt="Image" src="https://github.com/user-attachments/assets/df2faccb-6f9a-4aac-807f-3766d9f9c85c" />

The latency increases linearly (and sharply) until the length is at least that of a single AVX2/3 vector register.

Short inputs with fewer elements than that of a full vector register's lane count are deferred to a scalar fallback loop:
```c++
    // Won't be able to do a full vector load without padding => scalar loop.
    if (!kIsAtLeastOneVector && !kIsMultipleOfVector && !kIsPaddedToVector &&
        HWY_UNLIKELY(num_elements < N)) {
      float sum0 = 0.0f;  // Only 2x unroll to avoid excessive code size for..
      float sum1 = 0.0f;  // this unlikely(?) case.
      for (; i + 2 <= num_elements; i += 2) {
        sum0 += F32FromBF16(pa[i + 0]) * F32FromBF16(pb[i + 0]);
        sum1 += F32FromBF16(pa[i + 1]) * F32FromBF16(pb[i + 1]);
      }
      if (i < num_elements) {
        sum1 += F32FromBF16(pa[i]) * F32FromBF16(pb[i]);
      }
      return sum0 + sum1;
    }
```
There is nothing wrong or seemingly suboptimal with this code at all. But on GCC the actual codegen of `F32FromBF16` is not what one would expect (well, certainly not what _I_ would expect).

Example:
```c++
float hwy_bf16_to_float(hwy::bfloat16_t x) noexcept {
    return hwy::F32FromBF16(x);
}
```

Code generated by GCC 14.2 (`-std=c++23 -O3 -march=icelake-server`):
```asm
hwy_bf16_to_float(hwy::bfloat16_t):
        sub     rsp, 8
        vmovd   xmm0, edi
        call    __extendbfsf2
        add     rsp, 8
        ret
```
Note the `call __extendbfsf2` instruction. This is seemingly done to handle signalling NaNs at conversion time.

This means that all `F32FromBF16` calls result in actual `call` instructions, massively slowing down the scalar code path.

`hwy::bfloat16_t` wraps `__bf16` when available, and using it directly has the same result (modulo some calling convention details):

```c++
float my_bf16_to_float(__bf16 x) noexcept {
    return static_cast<float>(x);
}
```
```asm
my_bf16_to_float(std::bfloat16_t):
        sub     rsp, 8
        call    __extendbfsf2
        add     rsp, 8
        ret
```

From checking available GCC versions on Godbolt, this seems to be the case at least on versions from (and including) 13.1 up to (and including) HEAD.

For comparison, here's what Clang 21.1 generates (same compiler settings as GCC):
```asm
hwy_bf16_to_float(hwy::bfloat16_t):
        shl     edi, 16
        vmovd   xmm0, edi
        ret

my_bf16_to_float(std::bfloat16_t):
        vpextrw eax, xmm0, 0
        shl     eax, 16
        vmovd   xmm0, eax
        ret
```
This is basically what you'd expect; zero-extension to 32 bits and a left shift by 16.

[Godbolt link for x64/arm64](https://godbolt.org/z/Y8n7s4xs1), including dot product kernel codegen. Note the long chain of `__extendbfsf2` calls.

The ARM64 GCC output also has such library calls but interestingly the actual measured overhead from this is an order of magnitude lower than on x64...

`F32FromBF16` is currently defined as such:
```c++
HWY_API HWY_BF16_CONSTEXPR float F32FromBF16(bfloat16_t bf) {
#if HWY_HAVE_SCALAR_BF16_OPERATORS
  return static_cast<float>(bf);
#else
  return BitCastScalar<float>(static_cast<uint32_t>(
      static_cast<uint32_t>(BitCastScalar<uint16_t>(bf)) << 16));
#endif
}
```

Should Highway always default to explicitly bit-casting and shifting on GCC instead of going via `static_cast<float>`?

Note that GCC will use left-shifting instead of a library call if compiled with `-ffast-math`, but this is not a flag I'm comfortable using in the general case...!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scalar BFloat16 to float conversion has very suboptimal codegen on GCC #2699

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Scalar BFloat16 to float conversion has very suboptimal codegen on GCC #2699

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions