Adding f16 as Dtype #696

coreylowman · 2023-04-11T19:32:10Z

Resolves #423

coreylowman · 2023-04-11T19:32:34Z

FYI @opfromthestart @nkoppel If you guys want to work on this, you can open PRs into this feature branch

* Begin working on f16 cuda kernels * Prevent compilation for incompatible test features

Cargo.toml

src/lib.rs

coreylowman · 2023-04-20T18:15:22Z

I'm wondering if we should enforce f16/bf16 tests passing. There's particular tests where I just don't think it'll have the accuracy required to pass the tests, or we need a way to reduce the tolerance even more.

Here are the current failing t ests for f16:

    losses::tests::test_hard_crossentropy
    nn::batchnorm1d::tests::test_batchnorm1d_2d_forward_mut
    nn::batchnorm2d::tests::test_batchnorm2d_4d_forward_mut
    nn::linear::tests::test_forward_1d
    nn::linear::tests::test_linear_initialize
    nn::residual::tests::test_residual_gradients
    nn::unbiased_linear::tests::test_forward_1d
    optim::adam::tests::test_adam_decoupled_decay
    optim::adam::tests::test_custom_adam_one_params
    optim::adam::tests::test_default_adam_params
    tensor_ops::matmul::tests::test_matmul_vec_normal
    tensor_ops::matmul::tests::test_matmul_vec_transpose
    tensor_ops::matmul::tests::test_small_matmul_mm
    tensor_ops::max_to::tests::test_max_axis_0_2d
    tensor_ops::mean_to::tests::test_mean_axis_0_2d
    tensor_ops::pool2d::tests::test_pool2d_3d_max2d
    tensor_ops::sum_to::tests::test_sum_axes_3d_to_1d
    tensor_ops::sum_to::tests::test_sum_broadcasted
    tensor_ops::upscale2d::tests::test_bilinear_upscale2d_batched
    tensor_ops::upscale2d::tests::test_upscale2d_bilinear_even
    tensor_ops::upscale2d::tests::test_upscale2d_bilinear_uneven
    tensor_ops::upscale2d::tests::test_upscale2d_nearest_uneven

ViliamVadocz · 2023-04-26T17:45:56Z

I made a few more tests pass with some fixes: 254 passed; 103 failed.
How many tests are expected to pass?

EDIT: Now up to 255 passed; 102 failed.

coreylowman · 2023-04-26T19:37:41Z

On compute_cap 86 I'm only getting 8 failures

coreylowman · 2023-04-26T19:55:39Z

They are likely failing because sum is failing. I'm running cargo +nightly test --tests -F test-f16,cuda sum to run juts the sum tests

ViliamVadocz · 2023-04-26T20:08:41Z

In that case I'll just implement the atomicAdd directly instead of fiddling around with atomicCAS. It does mean that compatibility code will leech into the min_to and max_to files since they also use atomicCAS on shorts.

coreylowman · 2023-04-26T20:11:58Z

Okay this is where I'm at now:

__device__ __half atomicAdd(__half* address, __half val) {
    size_t align = reinterpret_cast<size_t>(address) & 2;
    unsigned int *address_as_u32 = reinterpret_cast<unsigned int *>(reinterpret_cast<char *>(address) - align);
    unsigned int old = *address_as_u32;
    unsigned int assumed;

    do {
        assumed = old;
        __half sum16 = __ushort_as_half(align ? (old >> 16) : (old & 0xffff)) + val;
        unsigned int sum32 = (unsigned int) __half_as_ushort(sum16);
        old = align ? ((sum32 << 16) | (old & 0xffff)) : ((old & 0xffff0000) | sum32);
        old = atomicCAS(address_as_u32, assumed, old);
    } while (assumed != old);
    return __ushort_as_half(align ? (old >> 16) : (old & 0xffff));
}

375 passed, 18 failed.

It seems like this doesn't handle inf properly as some of the errors i'm still getting are:

---- tensor_ops::max_to::tests::test_max_axis_0_2d stdout ----
thread 'tensor_ops::max_to::tests::test_max_axis_0_2d' panicked at 'lhs != rhs | -inf != 3', src/tensor_ops/max_to/mod.rs:97:9

---- tensor_ops::max_to::tests::test_max_axis_1_2d stdout ----
thread 'tensor_ops::max_to::tests::test_max_axis_1_2d' panicked at 'lhs != rhs | -inf != 2', src/tensor_ops/max_to/mod.rs:112:9

---- tensor_ops::max_to::tests::test_max_negative_zero stdout ----
thread 'tensor_ops::max_to::tests::test_max_negative_zero' panicked at 'lhs != rhs | -inf != 0', src/tensor_ops/max_to/mod.rs:136:9

---- tensor_ops::min_to::tests::test_min_axis_0_2d stdout ----
thread 'tensor_ops::min_to::tests::test_min_axis_0_2d' panicked at 'lhs != rhs | inf != 1', src/tensor_ops/min_to/mod.rs:97:9

---- tensor_ops::min_to::tests::test_min_axis_1_2d stdout ----
thread 'tensor_ops::min_to::tests::test_min_axis_1_2d' panicked at 'lhs != rhs | inf != 1', src/tensor_ops/min_to/mod.rs:112:9

---- tensor_ops::min_to::tests::test_min_negative_zero stdout ----
thread 'tensor_ops::min_to::tests::test_min_negative_zero' panicked at 'lhs != rhs | inf != -0', src/tensor_ops/min_to/mod.rs:136:9

ViliamVadocz · 2023-04-26T20:35:38Z

All those failing tests are min and max which use the probably broken atomicCAS. I'm almost done with my attempt.

ViliamVadocz · 2023-04-26T20:58:44Z

PR #742 is up

coreylowman · 2023-04-27T17:15:37Z

@ViliamVadocz nice work, all the tests pass for me now! 🚀 (other than the ones I broke from reverting optimizer kernels)

src/optim/adam/mod.rs

src/optim/rmsprop/mod.rs

src/optim/sgd/mod.rs

src/tensor_ops/max_to/max_to.cu

src/tensor_ops/min_to/min_to.cu

Adding f16 as Dtype

43edb29

coreylowman and others added 9 commits April 11, 2023 15:40

Adding matmul kernels

c8494fc

Adding AssertClose impl for f16/bf16

d0af7e9

Adding f16 to mnist

185f92f

Fixing cudarc dependency, adding f16 feature

c582d1e

Adding half/use-intrinsics on nightly

d622e6d

Merge branch 'main' into f16

e000a6e

Adding test_dtype!() macro

f237018

Adds compile error if both f16 and f64 are tested

880364b

Draft: Add f16 CUDA kernels (#705)

f126af9

* Begin working on f16 cuda kernels * Prevent compilation for incompatible test features

nkoppel reviewed Apr 19, 2023

View reviewed changes

Cargo.toml Outdated Show resolved Hide resolved

Merge branch 'main' into f16

e9e36b1

coreylowman force-pushed the f16 branch from b887829 to e9e36b1 Compare April 19, 2023 22:13

coreylowman added 10 commits April 19, 2023 18:14

Fixing merge conflict

7883460

Adding to_dtype for tests

6a44060

Merge branch 'main' into f16

a9e54bb

Merge branch 'main' into f16

3121bda

Adding scalar arithmetic for f16/f32. Fixing more tests

fafad01

Tests compiling

588be14

f32 tests passing

d6eccef

f64 tests passing for cpu

aee414a

Using NumCast instead of TestDtype

36d4c88

Adding bf16 feature flag

ded9941

coreylowman commented Apr 20, 2023

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

coreylowman commented Apr 20, 2023

View reviewed changes

src/lib.rs Outdated Show resolved Hide resolved

coreylowman added 3 commits April 20, 2023 14:19

Fixing f16/bf16 stuff for Device impl for cuda

9868f57

Adding cuda kernel sketches to all ops

c174ba1

Formatting

b571d66

ViliamVadocz and others added 3 commits April 27, 2023 08:22

Remove broken atomicCAS and redo atomicAdd,Max,Min (#742)

a5b129a

[Breaking] Optimizer config values are now all f64 (#744)

7fa7932

Fixing optimizer cuda kernels

e23dfa9

coreylowman added 6 commits April 27, 2023 17:17

Removing bf16

84f95ac

Revert changes to 06-mnist

433b49f

Updating test data for div

7dad9d3

Merge branch 'main' into f16

7d2bd9f

Fixing CUDA_INCLUDE_DIR when ci-check is on

1bdc025

Fixing serde test failures

61c1261