Attention Implementations in CUDA

This repository provides reference implementations of various attention mechanisms in CUDA, focusing on efficient GPU computation for deep learning models. It includes classic and modern attention variants, aiming to serve as a resource for benchmarking and understanding performance trade-offs in CUDA-based attention layers.

Installation and running

You can set up the environment using the provided spack environment files in this repository. Alternatively, ensure that the CUDA Toolkit is discoverable by CMake. While there is no strict CUDA version requirement, using CUDA 12 or newer is recommended as some implementations rely on cuBLAS, cuDNN, etc.

Building

# from root
./scripts/build.sh

Run the executable

./build/cuda_attention

TODO:

The aim is to perform forward pass (GEMM ➔ softmax ➔ GEMM) alone.

(Vanilla) Multi-Head Attention – parallel heads from the Transformer paper.
- Create a test kernel for batched/multi-headed GEMM using plain CUDA
- Create a test kernel for batched/multi-headed GEMM using MMA tensor cores
- Create a baseline kernel using cuBLAS/CUTLASS for batched/multi-headed GEMM using tensor cores
Sparse / Local Attention – e.g., Longformer or Neighborhood Attention.
Linformer / Linear Attention – low-rank or kernel tricks to reduce complexity.
Performer – FAVOR+ kernel feature maps for linear-time softmax approximation.
FlashAttention – memory-efficient, blockwise softmax on GPU.
FlashAttention-2 – improved tiling + parallelism for long sequences.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
benchmarks		benchmarks
include		include
scripts		scripts
src		src
.clang-format		.clang-format
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
params.yaml		params.yaml
spack.lock		spack.lock
spack.yaml		spack.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Attention Implementations in CUDA

Installation and running

Building

Run the executable

TODO:

About

Uh oh!

Releases

Packages

Languages

abhishek1297/attention-impl-in-CUDA

Folders and files

Latest commit

History

Repository files navigation

Attention Implementations in CUDA

Installation and running

Building

Run the executable

TODO:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages