A curated list of resources for learning and exploring Triton, OpenAI's programming language for writing efficient GPU code.
Gain deeper insights into Triton through these detailed articles:
- Understanding the Triton Tutorials Part 1 and Part 2
- Softmax in OpenAI Triton -> more detailed Fused Softmax Triton example explanation (step-by-step)
- Accelerating AI with Triton: A Deep Dive into Writing High-Performance GPU Code
- Accelerating Triton Dequantization Kernels for GPTQ
- Triton Tutorial #2
- Triton: OpenAI’s Innovative Programming Language for Custom Deep-Learning Primitives
- Triton Kernel Compilation Stages
- Deep Dive into Triton Internals Part 1, Part 2 and Part 3
- Exploring Triton GPU programming for neural networks in Java
- Using User-Defined Triton Kernels with torch.compile
- Mamba: The Hard Way
- FP8: Accelerating 2D Dynamic Block Quantized Float8 GEMMs in Triton
- FP8: Deep Dive on CUTLASS Ping-Pong GEMM Kernel
- FP8: Deep Dive on the Hopper TMA Unit for FP8 GEMMs
- Technical Review on PyTorch2.0 and Triton
- Towards Agile Development of Efficient Deep Learning Operators
- Developing Triton Kernels on AMD GPUs
Explore the academic foundation of Triton:
Learn by watching these informative videos:
- Lecture 14: Practitioners Guide to Triton and notebook
- Lecture 29: Triton Internals
- Intro to Triton: Coding Softmax in PyTorch
- Triton Vector Addition Kernel, part 1: Making the Shift to Parallel Programming
- Tiled Matrix Multiplication in Triton - part 1
- Flash Attention derived and coded from first principles with Triton (Python)
Watch Triton community meetups to be up to date with Triton recent topics.
Challenge yourself with these engaging puzzles:
Enhance your Triton development workflow with these tools:
- Triton Deja-vu Framework to reduce autotune overhead of triton-lang to zero for well known deployments. This small framework is based on the Triton autotuner and contributes two features to the Triton community: 1. store and safely restore autotuner states using JSON files, 2. ConfigSpaces to explore a defined space exhaustively. Additionally, it allows to use heuristics in combination with the autotuner.
- Triton Profiler and video explaining how to use it Dev Tools: Proton/Interpreter
- Triton-Viz: A Visualization Toolkit for Programming with Triton
- Make Triton easier - Triton-util provides simple higher-level abstractions for frequent but repetitive tasks. This allows you to write code that is closer to how you actually think.
- TritonBench is a collection of PyTorch operators used to evaluation the performance of Triton, and its integration with PyTorch.
Catch up on the latest advancements from Triton Conferences:
Explore practical implementations with these sample kernels:
- attorch is a subset of PyTorch's nn module, written purely in Python using OpenAI's Triton
- FlagGems is a high-performance general operator library implemented in OpenAI Triton. It aims to provide a suite of kernel functions to accelerate LLM training and inference.
- Kernl lets you run Pytorch transformer models several times faster on GPU with a single line of code, and is designed to be easily hackable.
- Linger-Kernel
- Triton Kernels for Efficient Low-Bit Matrix Multiplication
- Unsloth Kernels
- This is attempt at implementing a Triton kernel for GPTQ inference. This code is based on the GPTQ-for-LLaMa codebase, which is itself based on the GPTQ codebase.
- triton-index - Catalog openly available Triton kernels
- Triton-based implementation of Sparse Mixture-of-Experts (SMoE) on GPUs
- Variety of Triton and CUDA kernels for training and inference
- EquiTriton is a project that seeks to implement high-performance kernels for commonly used building blocks in equivariant neural networks, enabling compute efficient training and inference
- Expanded collection of Neural Network activation functions and other function kernels in Triton by OpenAI.
- Fused kernels
- Triton activations only feed forward
- LightLLM is a Python-based LLM (Large Language Model) inference and serving framework, notable for its lightweight design, easy scalability, and high-speed performance
- Bitsandbytes - ibrary is a lightweight Python wrapper around CUDA custom functions, in particular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 & 4-bit quantization functions
Kernel | Description | Resource |
---|---|---|
VectorAdd | A simple kernel that performs element-wise addition of two vectors. Useful for understanding the basics of GPU programming in Triton. | 1 2 |
Matmul | An optimized kernel for matrix multiplication, achieving high performance by leveraging memory hierarchy and parallelism. | 1 2 Grouped GEMM |
Softmax | A kernel for efficient computation of the softmax function, commonly used in machine learning models like transformers. | 1 2 3 |
Dropout | A kernel for implementing low-memory dropout, a regularization technique to prevent overfitting in neural networks. | 1 2 |
Layer Normalization | A kernel for layer normalization, which normalizes activations within a layer to improve training stability in deep learning models. | 1 2 3 |
Fused Attention | A kernel that efficiently implements attention mechanisms by combining multiple operations, key to transformers and similar architectures. | 1 2 |
Conv1d | A kernel for 1D convolution, often used in processing sequential data like time series or audio signals. | 1 |
Conv2d | A kernel for 2D convolution, a fundamental operation in computer vision tasks such as image classification or object detection. | 1 |
MultiheadAttention | A kernel for multi-head attention, a crucial component in transformer-based models for capturing complex relationships in data. | 1 |
Hardsigmoid | A kernel for the Hardsigmoid activation function, an efficient approximation of the sigmoid function used in certain neural network layers. | 1 |
GeLU | GeLU | 1 |
GeGLU | GeGLU | 1 |
RMSNorm | RMSNorm | 1 |
Feel free to contribute more resources or suggest updates by opening a pull request or issue in this repository.
This resource list is open-sourced under the MIT license.