Work on compatibility #735

ViliamVadocz · 2023-04-25T00:40:18Z

A few unrelated fixes to optimizers that I needed to compile and run the code.
Compatibility header with a few functions:
- Pay special attention to atomicCAS since it's quite unsafe and complicated. It's almost 3 AM so I worry I might have messed it up.
- I implemented __hmax_nan and __hmin_nan so that we can have the nice NaN propagation even on older GPUs
- Someone should try where the compute capability boundaries for when functions are defined actually lie. For now I just guessed 700.
I added an example Windows include path so that you have some idea of what it looks like for me.

I still get runtime errors like this:

---- tensor_ops::reshape_to::tests::test_1d_reshape_non_contiguous stdout ----
thread 'tensor_ops::reshape_to::tests::test_1d_reshape_non_contiguous' panicked at 'called `Result::unwrap()` on an `Err` value: CompileError { nvrtc: NvrtcError(NVRTC_ERROR_COMPILATION), options: ["--include-path=/usr/include", "--include-path=C:\\Program Files\\NVIDIA GPU Computing Toolkit\\CUDA\\v12.0\\include", "--gpu-architecture=sm_61"], log: "default_program(46): error: no instance of overloaded function \"atomicAdd\" matches the argument list\n            argument types are: (__half *, const __half)\n\n1 error detected in the compilation of \"default_program\".\n" }', src\tensor_ops\reshape_to\cuda_kernel.rs:79:56

which are very strange.

To be honest I think this PR breaks more things than it fixes. 🥴
For reference, running cargo test --features cuda,test-f16 I get: 245 passed; 112 failed;

coreylowman · 2023-04-25T13:06:20Z

src/tensor_ops/utilities/compatibility.cuh

+// We assume that atomicCAS for shorts has the same compute capability restrictions.
+#if __CUDA_ARCH__ < 700
+
+// Inspired by https://github.com/torch/cutorch/blob/master/lib/THC/THCAtomics.cuh#L96-L119


It seems like this is doing atomicAdd with unsigned int instead of unsigned short? Should we be doing that as well? Then we could basically just copy the function from the link without having to redefine atomicCAS

The reason I chose to do create atomicCAS instead is since the implementations for atomicMin and atomicMax use it. We could also rewrite two to work like this instead if we want, but that would mean we put compativility code inside max_to.cu and min_to.cu or move those functions here.

coreylowman · 2023-04-25T13:09:34Z

The runtime errors you are getting are from the JIT compiled kernels using nvrtc. They don't have the tensor_ops/utilities as an include directory (since I'm not sure if that path even exists in compiled binaries), so you can't use things in cuda_utils.cuh or your new compatibility header.

coreylowman

Going to merge this for now, feel free to make changes and open another PR

ViliamVadocz added 4 commits April 25, 2023 02:08

Fix struct initialization for optimizers

2554c94

Add some compatibility functions, mainly atomicCAS

f7ec784

Temporarily add an example Windows include path

d9e888c

Swap order of shorts in atomicCAS

153d044

coreylowman reviewed Apr 25, 2023

View reviewed changes

coreylowman approved these changes Apr 25, 2023

View reviewed changes

coreylowman merged commit 3fe083d into coreylowman:f16 Apr 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work on compatibility #735

Work on compatibility #735

ViliamVadocz commented Apr 25, 2023

coreylowman Apr 25, 2023

ViliamVadocz Apr 25, 2023

coreylowman commented Apr 25, 2023

coreylowman left a comment

Work on compatibility #735

Work on compatibility #735

Conversation

ViliamVadocz commented Apr 25, 2023

coreylowman Apr 25, 2023

Choose a reason for hiding this comment

ViliamVadocz Apr 25, 2023

Choose a reason for hiding this comment

coreylowman commented Apr 25, 2023

coreylowman left a comment

Choose a reason for hiding this comment