For self-learning purposes ~
- Implementation
- Pytorch
- CUDA
- Cute DSL
- triton
- tilelang
- to do kernel
-
Reduction
-
Prefix Sum
-
Top K Selection
-
K-Means Clustering
-
Elementwise
- ???
-
GEMM
- GEMM
- SGEMM
-
Attention
- flash-attention v1
- flash-attention v2
- flash-attention v3
- flash-attention v4
- Multi-Head Attention
-
Multi-Agent Simulation
-
LDPC
-
FFT
-
- Done kernel +
mamba create --name kernel_bench python=3.11
## cuda toolkit and dsl
# cuda
mamba install cuda-nvcc
# torch & triton
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130
# Cute DSL
pip install nvidia-cutlass-dsl
## else
mamba install colorama
mamba install loguru plotly pandas click
## install LazyGPU
cd utils
pip install -e .
-
NVIDIA GeForce RTX 4090
-
NVIDIA A100-SXM4-40GB
-
NVIDIA 5080
The implementation of this benchmark has benefited from the following sources: