-
-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding f16 as Dtype #696
Adding f16 as Dtype #696
Conversation
FYI @opfromthestart @nkoppel If you guys want to work on this, you can open PRs into this feature branch |
* Begin working on f16 cuda kernels * Prevent compilation for incompatible test features
I'm wondering if we should enforce f16/bf16 tests passing. There's particular tests where I just don't think it'll have the accuracy required to pass the tests, or we need a way to reduce the tolerance even more. Here are the current failing t ests for f16:
|
I made a few more tests pass with some fixes: 254 passed; 103 failed. EDIT: Now up to 255 passed; 102 failed. |
On compute_cap 86 I'm only getting 8 failures |
They are likely failing because sum is failing. I'm running |
In that case I'll just implement the atomicAdd directly instead of fiddling around with atomicCAS. It does mean that compatibility code will leech into the min_to and max_to files since they also use atomicCAS on shorts. |
Okay this is where I'm at now: __device__ __half atomicAdd(__half* address, __half val) {
size_t align = reinterpret_cast<size_t>(address) & 2;
unsigned int *address_as_u32 = reinterpret_cast<unsigned int *>(reinterpret_cast<char *>(address) - align);
unsigned int old = *address_as_u32;
unsigned int assumed;
do {
assumed = old;
__half sum16 = __ushort_as_half(align ? (old >> 16) : (old & 0xffff)) + val;
unsigned int sum32 = (unsigned int) __half_as_ushort(sum16);
old = align ? ((sum32 << 16) | (old & 0xffff)) : ((old & 0xffff0000) | sum32);
old = atomicCAS(address_as_u32, assumed, old);
} while (assumed != old);
return __ushort_as_half(align ? (old >> 16) : (old & 0xffff));
} 375 passed, 18 failed. It seems like this doesn't handle inf properly as some of the errors i'm still getting are: ---- tensor_ops::max_to::tests::test_max_axis_0_2d stdout ----
thread 'tensor_ops::max_to::tests::test_max_axis_0_2d' panicked at 'lhs != rhs | -inf != 3', src/tensor_ops/max_to/mod.rs:97:9
---- tensor_ops::max_to::tests::test_max_axis_1_2d stdout ----
thread 'tensor_ops::max_to::tests::test_max_axis_1_2d' panicked at 'lhs != rhs | -inf != 2', src/tensor_ops/max_to/mod.rs:112:9
---- tensor_ops::max_to::tests::test_max_negative_zero stdout ----
thread 'tensor_ops::max_to::tests::test_max_negative_zero' panicked at 'lhs != rhs | -inf != 0', src/tensor_ops/max_to/mod.rs:136:9
---- tensor_ops::min_to::tests::test_min_axis_0_2d stdout ----
thread 'tensor_ops::min_to::tests::test_min_axis_0_2d' panicked at 'lhs != rhs | inf != 1', src/tensor_ops/min_to/mod.rs:97:9
---- tensor_ops::min_to::tests::test_min_axis_1_2d stdout ----
thread 'tensor_ops::min_to::tests::test_min_axis_1_2d' panicked at 'lhs != rhs | inf != 1', src/tensor_ops/min_to/mod.rs:112:9
---- tensor_ops::min_to::tests::test_min_negative_zero stdout ----
thread 'tensor_ops::min_to::tests::test_min_negative_zero' panicked at 'lhs != rhs | inf != -0', src/tensor_ops/min_to/mod.rs:136:9 |
All those failing tests are min and max which use the probably broken atomicCAS. I'm almost done with my attempt. |
PR #742 is up |
@ViliamVadocz nice work, all the tests pass for me now! 🚀 (other than the ones I broke from reverting optimizer kernels) |
Resolves #423