Algorithm Overview

Standard Attention

Transformer attention computes:

Attention(Q, K, V) = softmax(QKᵀ / √d) V

For a sequence length N, standard attention creates an N x N attention matrix. This becomes expensive for long sequences because memory usage grows quadratically with sequence length.

FlashAttention Idea

FlashAttention avoids materializing the full attention matrix in GPU high-bandwidth memory. Instead, it processes blocks of queries, keys, and values, keeps intermediate values in faster on-chip memory when possible, and uses an online softmax update to maintain numerical stability.

This reduces memory traffic and allows attention to scale better to longer sequences.

What This Project Implements

This project implements a Triton-based FlashAttention kernel with:

Block-wise forward attention computation
Causal and non-causal masking
Numerically stable softmax updates
Custom PyTorch autograd integration
Backward kernels for gradients
CPU reference attention for correctness testing

Why Tiling Matters

Naive attention materializes the full score matrix before applying softmax and multiplying by V.

The Triton implementation processes attention in blocks:

Load a block of Q.
Iterate over blocks of K and V.
Compute partial attention scores.
Apply causal masking when needed.
Update running softmax statistics.
Accumulate the output block.

This approach reduces the amount of intermediate data written to and read from GPU global memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Algorithm Overview

Standard Attention

FlashAttention Idea

What This Project Implements

Why Tiling Matters

References

FilesExpand file tree

algorithm.md

Latest commit

History

algorithm.md

File metadata and controls

Algorithm Overview

Standard Attention

FlashAttention Idea

What This Project Implements

Why Tiling Matters

References