Skip to content

Scalar BFloat16 to float conversion has very suboptimal codegen on GCC #2699

@vekterli

Description

@vekterli

Hi, and first of all thanks for a great library!

When doing benchmarking of the dot product kernels included in Highway (contrib/dot/dot-inl.h) when compiled under GCC 14.2 for x64 targets, I noticed some rather interesting performance characteristics of the BFloat16 overload.

Here's a graph showing input length (in bf16 elements) vs latency in nanoseconds:

Image

The latency increases linearly (and sharply) until the length is at least that of a single AVX2/3 vector register.

Short inputs with fewer elements than that of a full vector register's lane count are deferred to a scalar fallback loop:

    // Won't be able to do a full vector load without padding => scalar loop.
    if (!kIsAtLeastOneVector && !kIsMultipleOfVector && !kIsPaddedToVector &&
        HWY_UNLIKELY(num_elements < N)) {
      float sum0 = 0.0f;  // Only 2x unroll to avoid excessive code size for..
      float sum1 = 0.0f;  // this unlikely(?) case.
      for (; i + 2 <= num_elements; i += 2) {
        sum0 += F32FromBF16(pa[i + 0]) * F32FromBF16(pb[i + 0]);
        sum1 += F32FromBF16(pa[i + 1]) * F32FromBF16(pb[i + 1]);
      }
      if (i < num_elements) {
        sum1 += F32FromBF16(pa[i]) * F32FromBF16(pb[i]);
      }
      return sum0 + sum1;
    }

There is nothing wrong or seemingly suboptimal with this code at all. But on GCC the actual codegen of F32FromBF16 is not what one would expect (well, certainly not what I would expect).

Example:

float hwy_bf16_to_float(hwy::bfloat16_t x) noexcept {
    return hwy::F32FromBF16(x);
}

Code generated by GCC 14.2 (-std=c++23 -O3 -march=icelake-server):

hwy_bf16_to_float(hwy::bfloat16_t):
        sub     rsp, 8
        vmovd   xmm0, edi
        call    __extendbfsf2
        add     rsp, 8
        ret

Note the call __extendbfsf2 instruction. This is seemingly done to handle signalling NaNs at conversion time.

This means that all F32FromBF16 calls result in actual call instructions, massively slowing down the scalar code path.

hwy::bfloat16_t wraps __bf16 when available, and using it directly has the same result (modulo some calling convention details):

float my_bf16_to_float(__bf16 x) noexcept {
    return static_cast<float>(x);
}
my_bf16_to_float(std::bfloat16_t):
        sub     rsp, 8
        call    __extendbfsf2
        add     rsp, 8
        ret

From checking available GCC versions on Godbolt, this seems to be the case at least on versions from (and including) 13.1 up to (and including) HEAD.

For comparison, here's what Clang 21.1 generates (same compiler settings as GCC):

hwy_bf16_to_float(hwy::bfloat16_t):
        shl     edi, 16
        vmovd   xmm0, edi
        ret

my_bf16_to_float(std::bfloat16_t):
        vpextrw eax, xmm0, 0
        shl     eax, 16
        vmovd   xmm0, eax
        ret

This is basically what you'd expect; zero-extension to 32 bits and a left shift by 16.

Godbolt link for x64/arm64, including dot product kernel codegen. Note the long chain of __extendbfsf2 calls.

The ARM64 GCC output also has such library calls but interestingly the actual measured overhead from this is an order of magnitude lower than on x64...

F32FromBF16 is currently defined as such:

HWY_API HWY_BF16_CONSTEXPR float F32FromBF16(bfloat16_t bf) {
#if HWY_HAVE_SCALAR_BF16_OPERATORS
  return static_cast<float>(bf);
#else
  return BitCastScalar<float>(static_cast<uint32_t>(
      static_cast<uint32_t>(BitCastScalar<uint16_t>(bf)) << 16));
#endif
}

Should Highway always default to explicitly bit-casting and shifting on GCC instead of going via static_cast<float>?

Note that GCC will use left-shifting instead of a library call if compiled with -ffast-math, but this is not a flag I'm comfortable using in the general case...!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions