Skip to content

f16 performance is abysmal, u8, u16, f32 and casting f16 to f32 performance is excellent #19550

@adworacz

Description

@adworacz

Zig Version

0.12.0-dev.3518+d2be725e4

Steps to Reproduce and Observed Behavior

I'm working on an open source video processing library that supports video stored in various bit depths. This specifically includes u8, u16, f16, and f32 bit depths.

While writing code, I was finding a massive performance disparity between f16 and all other bit depths. What's even more interesting is that I am getting much better performance if I manually cast my f16 data to f32, perform an operation, and then manually cast back to f16.

I'd expect that the compiler would effectively do this conversion/casting for me.

The following code shows a good example of what I'm talking about:

//const T = u8;
//const T = u16;
//const T = f16;
//const T = f32;
//const T = @Vector(32, u8);
//const T = @Vector(16, u16);
const T = @Vector(16, f16);
//const T = @Vector(16, f32);

export fn clamp(c: T, a1: T, a2: T, a3: T, a4: T, a5: T, a6: T, a7: T, a8:T) T {
    const min = @min(a1, a2, a3, a4, a5, a6, a7, a8);
    const max = @max(a1, a2, a3, a4, a5, a6, a7, a8);

    return @max(min, @min(c, max));
}

The vector typed versions best demonstrate the problem.

Using Godbolt (with ReleaseFast), the same code takes the following number of instructions:

  • Vector u8: 25
  • Vector u16: 25
  • Vector f16: 2304
  • Vector f32: 56

That's a massive difference in the number of instructions for f16, and it seems like the optimizer is really missing out on the ability to convert f16 to f32 once, run an operation, and then convert back.

Full Godbolt link: https://zig.godbolt.org/z/baP7P6s8o

I have lots of other examples of poor f16 behavior in different video processing algorithms, this is just a nice succinct example.

As a point of reference - this same code is able to process u8, u16, and f32 bit depths at 1920x1080 video on a cheap laptop in several hundred frames per second, but f16 drops that down to 5-10 frames per second.

Expected Behavior

f16 performance to meet or exceed f32 performance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementSolving this issue will likely involve adding new logic or components to the codebase.optimization

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions