-
Notifications
You must be signed in to change notification settings - Fork 399
Description
Hi, and first of all thanks for a great library!
When doing benchmarking of the dot product kernels included in Highway (contrib/dot/dot-inl.h) when compiled under GCC 14.2 for x64 targets, I noticed some rather interesting performance characteristics of the BFloat16 overload.
Here's a graph showing input length (in bf16 elements) vs latency in nanoseconds:
The latency increases linearly (and sharply) until the length is at least that of a single AVX2/3 vector register.
Short inputs with fewer elements than that of a full vector register's lane count are deferred to a scalar fallback loop:
// Won't be able to do a full vector load without padding => scalar loop.
if (!kIsAtLeastOneVector && !kIsMultipleOfVector && !kIsPaddedToVector &&
HWY_UNLIKELY(num_elements < N)) {
float sum0 = 0.0f; // Only 2x unroll to avoid excessive code size for..
float sum1 = 0.0f; // this unlikely(?) case.
for (; i + 2 <= num_elements; i += 2) {
sum0 += F32FromBF16(pa[i + 0]) * F32FromBF16(pb[i + 0]);
sum1 += F32FromBF16(pa[i + 1]) * F32FromBF16(pb[i + 1]);
}
if (i < num_elements) {
sum1 += F32FromBF16(pa[i]) * F32FromBF16(pb[i]);
}
return sum0 + sum1;
}There is nothing wrong or seemingly suboptimal with this code at all. But on GCC the actual codegen of F32FromBF16 is not what one would expect (well, certainly not what I would expect).
Example:
float hwy_bf16_to_float(hwy::bfloat16_t x) noexcept {
return hwy::F32FromBF16(x);
}Code generated by GCC 14.2 (-std=c++23 -O3 -march=icelake-server):
hwy_bf16_to_float(hwy::bfloat16_t):
sub rsp, 8
vmovd xmm0, edi
call __extendbfsf2
add rsp, 8
retNote the call __extendbfsf2 instruction. This is seemingly done to handle signalling NaNs at conversion time.
This means that all F32FromBF16 calls result in actual call instructions, massively slowing down the scalar code path.
hwy::bfloat16_t wraps __bf16 when available, and using it directly has the same result (modulo some calling convention details):
float my_bf16_to_float(__bf16 x) noexcept {
return static_cast<float>(x);
}my_bf16_to_float(std::bfloat16_t):
sub rsp, 8
call __extendbfsf2
add rsp, 8
retFrom checking available GCC versions on Godbolt, this seems to be the case at least on versions from (and including) 13.1 up to (and including) HEAD.
For comparison, here's what Clang 21.1 generates (same compiler settings as GCC):
hwy_bf16_to_float(hwy::bfloat16_t):
shl edi, 16
vmovd xmm0, edi
ret
my_bf16_to_float(std::bfloat16_t):
vpextrw eax, xmm0, 0
shl eax, 16
vmovd xmm0, eax
retThis is basically what you'd expect; zero-extension to 32 bits and a left shift by 16.
Godbolt link for x64/arm64, including dot product kernel codegen. Note the long chain of __extendbfsf2 calls.
The ARM64 GCC output also has such library calls but interestingly the actual measured overhead from this is an order of magnitude lower than on x64...
F32FromBF16 is currently defined as such:
HWY_API HWY_BF16_CONSTEXPR float F32FromBF16(bfloat16_t bf) {
#if HWY_HAVE_SCALAR_BF16_OPERATORS
return static_cast<float>(bf);
#else
return BitCastScalar<float>(static_cast<uint32_t>(
static_cast<uint32_t>(BitCastScalar<uint16_t>(bf)) << 16));
#endif
}Should Highway always default to explicitly bit-casting and shifting on GCC instead of going via static_cast<float>?
Note that GCC will use left-shifting instead of a library call if compiled with -ffast-math, but this is not a flag I'm comfortable using in the general case...!