Welcome to the Cute-Learning repository! This project showcases several example implementations using Cutlass CuTe, a powerful tool for high-performance computing.
This repository includes implementations for:
- GEMM (General Matrix Multiply)
- GEMV (General Matrix-Vector Multiply)
- Flash-Decoding
- Data Copy
- LDSM (ldmatrix instruction)
- Tensor Dequant
- TODO... (More features to come!)
The GEMM implementation is optimized for performance. Below is a performance graph showcasing its efficiency:
Refer to the following blog:
Refer to the following blog:
We hope you find this repository useful for your learning and development needs. Contributions and feedback are welcome!