kernel-bench

For self-learning purposes ~

Implementation
- Pytorch
- CUDA
- Cute DSL
- triton
- tilelang
to do kernel
- Reduction
- Prefix Sum
- Top K Selection
- K-Means Clustering
- Elementwise
  - ???
- GEMM
  - GEMM
  - SGEMM
- Attention
  - flash-attention v1
  - flash-attention v2
  - flash-attention v3
  - flash-attention v4
  - Multi-Head Attention
- Multi-Agent Simulation
- LDPC
- FFT
Done kernel +

Usage

Setup Env

mamba create --name kernel_bench python=3.11

## cuda toolkit and dsl
# cuda
mamba install cuda-nvcc
# torch & triton
pip3 install torch torchvision --index-url https://download.pytorch.org/whl/cu130   
#  Cute DSL
pip install nvidia-cutlass-dsl


## else
mamba install colorama
mamba install loguru plotly pandas click


## install LazyGPU
cd utils
pip install -e .

Performance

NVIDIA GeForce RTX 4090
NVIDIA A100-SXM4-40GB
NVIDIA 5080

References

The implementation of this benchmark has benefited from the following sources:

[1] Triton: An Intermediate Language and Compiler for Tiled Neural Network Computations on GPUs

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
cfx-article-src		cfx-article-src
kernels		kernels
learning_records		learning_records
popcorn-cli		popcorn-cli
scripts		scripts
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

kernel-bench

Usage

Setup Env

Performance

References

About

Uh oh!

Releases

Packages

Languages

shenjy0829/kernel-bench

Folders and files

Latest commit

History

Repository files navigation

kernel-bench

Usage

Setup Env

Performance

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages