Set of CUDA kernel functions with various optimizations accompanied by simple C and PyTorch APIs as well as unit and performance tests with api usage examples.
kernels/: Main cuda kernels directory with kernel's source codekernelsAPIs/: Available kernels APIs for convenient usagecAPI/: C/C++ APIs for each kerneltorchAPI/: PyTorch C/C++ APIs for each kernelpython/: PyBindings for PyTorch C/C++ kernels APIs
tests/: Unit and performance tests for each kernel. Uses python extensioncpp_samples/: Sample use cases for each kernel using C/C++ APIpython_samples/: Sample use cases for each kernel using python extension
- Download and install the CUDA Toolkit for your platform.
- Download and install CMake.
- To be able to compile pyTorchApi and run test and python samples following packages are needed:
- pyTorch with cuda support (same version as your installed cuda toolkit)
- pybind11
- pytest
- scikit-image
- matplotlib
Note: If you are using cuda 12.6 feel free to install requirements, otherwise install pyTorch with needed cuda version support.
-
Clone the repository:
git clone https://github.com/arrzhev/cuda-kernels.git -
Navigate to the root of the cloned repository and create a build directory:
mkdir build && cd build -
Configure the project with CMake:
cmake ..If you want to build python samples, use the following flags:
- BUILD_PYTHON
- PYTHON_MODULE_INSTALL_DIR
Example:
cmake -DBUILD_PYTHON=ON -DPYTHON_MODULE_INSTALL_DIR="path/to/your/python/lib/site-packages/" .. -
Build the project:
cmake --build .
-
Test each kernel via C API:
Example of how to run VectorAdd kernel from C API on Linux
build/cpp_samples/vectorAdd -
Test kernels via pytest and PyTorch API:
pytest -s
Examples of achieved performance running kernels on NVIDIA Tesla T4 (Turing).
- Matmul shape 2048x2048x2048, all matrices have row-major layout and fp32 data type:
| Kernel | TFLOPs/s | Performance compared to PyTorch |
|---|---|---|
| 0: PyTorch | 3.57 |
100% |
| 1: Naive | 0.06 |
1.57% |
| 2: Coalescing | 0.47 |
16.13% |
| 3: Block tiles | 0.79 |
22.69% |
| 4: 1D Thread tiles | 2.05 |
62.30% |
| 5: 2D Thread tiles | 2.90 |
87.97% |
| 7: 2D Thread tiles vectorized | 3.39 |
94.72% |
| 8: 2D Thread tiles vectorized double buffer | 3.36 |
94.40% |
- Single optimized kernel performance on various matrices shapes, all matrices have row-major layout and fp32 data type:
| Id | M | N | K | My kernel TFLOPs/s | PyTorch TFLOPs/s | Performance compared to PyTorch |
|---|---|---|---|---|---|---|
| 1 | 128 | 128 | 128 | 0.13 |
0.20 |
62.99% |
| 2 | 256 | 256 | 256 | 0.96 |
1.11 |
87.68% |
| 3 | 512 | 512 | 512 | 2.19 |
3.81 |
57.57% |
| 4 | 1024 | 1024 | 1024 | 3.29 |
3.80 |
86.65% |
| 5 | 2048 | 2048 | 2048 | 3.38 |
3.65 |
92.51% |
| 6 | 512 | 4068 | 7168 | 3.18 |
4.30 |
73.97% |
| 7 | 512 | 7168 | 2304 | 3.44 |
4.33 |
79.53% |
| 8 | 512 | 512 | 1024 | 2.10 |
3.80 |
55.20% |
| 9 | 512 | 512 | 2048 | 2.66 |
3.95 |
67.34% |
| 10 | 512 | 512 | 4096 | 2.90 |
4.05 |
71.65% |
| 11 | 512 | 512 | 8192 | 2.98 |
3.90 |
76.41% |