Skip to content

Commit 0304281

Browse files
authored
New README (#392)
* New README * yolo * yolo * Update README.md * Update README.md * Trigger CI * Trigger CI * Trigger CI * push * push * push * push * push * push * push * push * push * push * push
1 parent 6b0ca2d commit 0304281

File tree

1 file changed

+96
-110
lines changed

1 file changed

+96
-110
lines changed

README.md

Lines changed: 96 additions & 110 deletions
Original file line numberDiff line numberDiff line change
@@ -2,147 +2,133 @@
22

33
[![](https://dcbadge.vercel.app/api/server/cudamode?style=flat)](https://discord.gg/cudamode)
44

5-
This repository is currently under heavy development - if you have suggestions on the API or use-cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues)
65

76
## Introduction
8-
`torchao` is a PyTorch library for quantization and sparsity.
7+
8+
torchao is a library to create and integrate high-performance custom data types, layouts and kernels into their PyTorch workflows with up to **2x speedups** with **65%** less VRAM for [inference](#inference) and support for [training](#training)
9+
10+
All with no intrusive code changes and minimal accuracy degradation.
11+
12+
## Benchmarks
13+
14+
### Inference
15+
16+
#### Without intrusive code changes
17+
18+
Quantizing your models is a 1 liner that should work on any model with `nn.Linear` including your favorite HuggingFace model. You can find a more comprehensive usage instructions [here](torchao/quantization/) and a hugginface inference example [here](scripts/hf_eval.py)
19+
20+
```python
21+
from torchao.quantization.quant_api import quantize
22+
m = quantize(m, "int4wo")
23+
```
24+
25+
Benchmarks are run on a machine with a single A100 GPU using the script in `_models/llama` which generates text in a latency-optimized way (batchsize=1)
26+
27+
The models used were `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Meta-Llama-3-8B`.
28+
29+
| Model | Technique | wikitext-perplexity | Tokens/Second | Memory Bandwidth (GB/s) | Peak Memory (GB) | Model Size (GB) |
30+
| ----------- | ------------------ | ------------------- | ------------- | ----------------------- | ---------------- | --------------- |
31+
| Llama-2-7B | Base (bfloat16) | 12.212 | 105.02 | 1387.78 | 13.21 | 13.90 |
32+
| | int8dq | 12.262 | 9.40 | 62.26 | 6.62 | 8.61 |
33+
| | int8wo | 12.204 | 147.03 | 973.54 | 6.62 | 8.95 |
34+
| | int4wo-64 | 12.843 | 199.81 | 746.45 | 3.74 | 4.75 |
35+
| | int4wo-64-GPTQ | 12.489 | 199.81 | 746.45 | 3.74 | 4.75 |
36+
| Llama-3-8B | Base (bfloat16) | | 94.91 | 1424.58 | 15.01 | 16.43 |
37+
| | int8dq | | 8.41 | 63.23 | 7.52 | 9.24 |
38+
| | int8wo | | 136.75 | 1028.38 | 7.52 | 10.42 |
39+
| | int4wo-64 | | 179.41 | 757.45 | 4.22 | 6.88 |
40+
41+
note: Int8 dynamic quantization works best on compute bound as opposed to memory bound models. Some relatable examples might be [SAM](https://github.com/pytorch-labs/segment-anything-fast) which is compute bound vs Llama at batchsize=1 which is memory bound.
42+
43+
For int4 we make heavy use of [tinygemm](https://github.com/pytorch/ao/blob/cb3bd8c674f2123af232a0231b5e38ddafa756a8/torchao/dtypes/aqt.py#L526) of `torch.ops.aten._weight_int4pack_mm` to bitpack into a layout optimized for tensor cores
44+
45+
And a quick crash course on inference quantization to help parse the above table. Int4 quantization is an ambiguous term because there's the dtype in which a layer is represented and then the dtype in which the computation is done. For example, if you're using Weight-Only (wo) int4 quantization that means that the layer will be upcasted to a larger dtype like fp16 so an int4 matrix multiplication is defined as `F.linear(input, weight.to(input.dtype))`. Dynamic quantization (DQ) primarily targets activations, enabling on-the-fly quantization from higher precision formats like bf16 to lower precision formats such as int8. This process, when supported by hardware, allows for direct computation, such as performing `F.linear(input, weight)`. Naive quantization algorithms are also notoriously sensitive to outliers so we also typically set a group size that applies a scale factor per group of 64 elements in the case of `int4wo64`.
46+
47+
48+
#### With intrusive code changes
49+
50+
In some cases we rewrote popular GenAI models to be significantly faster in native PyTorch as in no C++/CUDA to achieve at the time SOTA inference performance. These involve more intrusive code changes.
51+
52+
* 8x speedups for Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
53+
* 10x speedups for Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
54+
* 3x speedup for Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3)
55+
56+
### Training
57+
58+
We've added support for semi-structured 2:4 sparsity with 6% end to end speedups on ViT-L
59+
60+
The code change is a 1 liner with the full example available [here](torchao/sparsity/training/)
61+
62+
63+
```python
64+
swap_linear_with_semi_sparse_linear(model, {"seq.0": SemiSparseLinear})
65+
```
66+
67+
68+
## Newer dtypes
69+
70+
* [MX](torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.
71+
* [nf4](torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) one of the most popular finetuning algorithms without writing custom Triton or CUDA code. Accessible talk [here](https://x.com/HamelHusain/status/1800315287574847701)
72+
* [fp6](torchao/prototype/fp6_llm/) for 2x faster inference over fp16 with an easy to use wrapper api `convert_fp6_llm(model)`
73+
74+
## Composability
75+
76+
A key design principle for us is composability as in any new dtype or layout we provide needs to work with `torch.compile()` and needs to work with `FSDP`. It shouldn't matter if the kernels are written are pure PyTorch, CUDA, C++, or Triton - things should just work! And here is our current strategy
77+
1. Write the dtype, layout or bit packing logic in pure PyTorch and code-generate efficient kernels with torch.compile. You can inspect those kernels with `TORCH_LOGS="output_code" python your_code.py` and check if a single kernel is being generated and if any unnecessary buffers are being created
78+
2. However once you get a kernel, how do you know how good it is? The best way is to benchmark the code-generated code with the best kernel on the market. But packaging custom CPP/CUDA kernels that work on multiple devices is tedious but we've abstracted all the tedium from you with our [custom ops support](./torchao/csrc/) so if you love writing kernels but hate packaging, we'd love to accept contributions for your custom ops. One key benefit is a kernel written as a custom op will just work with no graph breaks with `torch.compile()`. Compilers are great at optimizations like fusions and overhead reduction but it's challenging for compilers to rewrite the math of an algorithm such that it's faster but also numerically stable so we are betting on both compilers and custom ops
79+
3. Finally while historically most quantization has been done for inference there is now a thriving area of research combining lower dtypes and sharding. One popular example is [NF4](torchao/dtypes/nf4tensor.py) which is used to create the QLoRA algorithm and you can define the semantics for how custom tensors should be sharded over multiple devices. We gave an accessible talk on [how to do this](https://x.com/HamelHusain/status/1800315287574847701).
980

1081
## Get Started
1182

1283
### Installation
13-
`torchao` makes liberal use of several new features in pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.
84+
`torchao` makes liberal use of several new features in Pytorch, it's recommended to use it with the current nightly or latest stable version of PyTorch.
1485

1586
Stable Release
1687
```Shell
17-
pip install torchao
88+
pip install torchao --extra-index-url https://download.pytorch.org/whl/test/cu121 # full options are cpu/cu118/cu121/cu124
1889
```
1990

2091
Nightly Release
2192
```Shell
22-
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cpu # CPU only builds
23-
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu118 # CUDA 11.8
24-
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu121 # CUDA 12.1
25-
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu124 # CUDA 12.4
26-
93+
pip install --pre torchao-nightly --index-url https://download.pytorch.org/whl/nightly/cu121 # full options are cpu/cu118/cu121/cu124
2794
```
2895

29-
From source
96+
## Community Contributions
97+
98+
* [jeromeku](https://github.com/jeromeku) has implemented
99+
* [GaLore](torchao/prototype/galore/) a drop for the Adam Optimizer that allows you to finetune llama 7b on a single 4090 card with up to 70% speedups relative to eager PyTorch
100+
* [DoRA](torchao/prototype/dora) a newer replacement for QLoRA with more promising convergence characteristics
101+
* [Fused int4/fp16 Quant Matmul](torchao/prototype/hqq) which is particularly useful for compute bound kernels showing 4x speedups over tinygemm for larger batch sizes such as 512
102+
* [gau-nernst](https://github.com/gau-nernst) fp6 kernels that are 4x faster than fp16 [torchao/prototype/fp6_llm](torchao/prototype/fp6_llm)
103+
* [vayuda](https://github.com/vayuda) with generic bitpacking kernels that were code generated using pure PyTorch [prototype/common](torchao/prototype/common)
104+
* [andreaskopf](https://github.com/andreaskoepf) and [melvinebenezer](https://github.com/melvinebenezer) with [1 bit LLMs](torchao/prototype/dtypes) Bitnet 1.58 bitpacked into uin2 and fully code-generated with torch.compile
105+
106+
## How to contribute
107+
108+
This repository is currently under heavy development
109+
* If you have suggestions on the API or use cases you'd like to be covered, please open an [issue](https://github.com/pytorch/ao/issues)
110+
* If you'd like to co-develop the library with us please join us on #torchao on [discord.gg/cudamode](https://discord.gg/cudamode) - there are a lot of dtypes out there and we could use a lot more hands to make them go brrr
111+
112+
Installation instructions
30113

31114
```Shell
32115
git clone https://github.com/pytorch/ao
33116
cd ao
34-
python setup.py install
117+
python setup.py install
35118
```
36119

37-
If you plan to be developing the library run:
120+
If you're contributing a feature ao
38121
```Shell
39122
pip install -r dev-requirements.txt
40123
python setup.py develop
41124
```
42125

43-
** Note:
44-
If you are running into any issues while building `ao` cpp extensions you can instead build using
126+
For *most* developers you probably want to skip building custom C++/CUDA extensions for faster iteration cycles
45127

46128
```shell
47129
USE_CPP=0 python setup.py install
48130
```
49131

50-
### Quantization
51-
52-
```python
53-
import torch
54-
import torchao
55-
56-
# inductor settings which improve torch.compile performance for quantized modules
57-
torch._inductor.config.force_fuse_int_mm_with_mul = True
58-
torch._inductor.config.use_mixed_mm = True
59-
60-
# Plug in your model and example input
61-
model = torch.nn.Sequential(torch.nn.Linear(32, 64)).cuda().to(torch.bfloat16)
62-
input = torch.randn(32,32, dtype=torch.bfloat16, device='cuda')
63-
64-
# perform autoquantization and compilation
65-
q_model = torchao.autoquant(torch.compile(model, mode='max-autotune'))
66-
q_model(input)
67-
```
68-
69-
### Sparsity
70-
71-
```python
72-
import torch
73-
from torch.sparse import to_sparse_semi_structured, SparseSemiStructuredTensor
74-
from torch.ao.pruning import WeightNormSparsifier
75-
76-
# bfloat16 CUDA model
77-
model = torch.nn.Sequential(torch.nn.Linear(64, 64)).cuda().to(torch.bfloat16)
78-
79-
# Accuracy: Finding a sparse subnetwork
80-
sparse_config = []
81-
for name, mod in model.named_modules():
82-
if isinstance(mod, torch.nn.Linear):
83-
sparse_config.append({"tensor_fqn": f"{name}.weight"})
84-
85-
sparsifier = WeightNormSparsifier(sparsity_level=1.0,
86-
sparse_block_shape=(1,4),
87-
zeros_per_block=2)
88-
89-
# attach FakeSparsity
90-
sparsifier.prepare(model, sparse_config)
91-
sparsifier.step()
92-
sparsifier.squash_mask()
93-
# now we have dense model with sparse weights
94-
95-
# Performance: Accelerated sparse inference
96-
for name, mod in model.named_modules():
97-
if isinstance(mod, torch.nn.Linear):
98-
mod.weight = torch.nn.Parameter(to_sparse_semi_structured(mod.weight))
99-
```
100-
101-
To learn more try out our APIs, you can check out API examples in
102-
* [quantization](./torchao/quantization)
103-
* [sparsity](./torchao/sparsity)
104-
* [dtypes](./torchao/dtypes)
105-
106-
107-
## Supported Features
108-
1. [Quantization algorithms](./torchao/quantization)
109-
- [Int8 weight-only](https://github.com/pytorch/ao/blob/main/torchao/quantization/weight_only.py) quantization
110-
- [Int4 weight-only](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/cuda/int4mm.cu) quantization
111-
- [GPTQ](https://github.com/pytorch/ao/blob/main/torchao/quantization/GPTQ.py) and [Smoothquant](https://github.com/pytorch/ao/blob/main/torchao/quantization/smoothquant.py) for low latency inference
112-
- High level [torchao.autoquant API](https://github.com/pytorch/ao/blob/main/torchao/quantization/autoquant.py) and [kernel autotuner](https://github.com/pytorch/ao/blob/main/torchao/kernel/autotuner.py) targeting SOTA performance across varying model shapes on consumer and enterprise GPUs
113-
2. [Sparsity algorithms](./torchao/sparsity) such as Wanda that help improve accuracy of sparse networks
114-
3. Support for lower precision [dtypes](./torchao/dtypes) such as
115-
- [nf4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py) which was used to [implement QLoRA](https://github.com/pytorch/torchtune/blob/main/docs/source/tutorials/qlora_finetune.rst) without writing custom Triton or CUDA code
116-
- [uint4](https://github.com/pytorch/ao/blob/main/torchao/dtypes/uint4.py)
117-
- [MX](https://github.com/pytorch/ao/blob/main/torchao/prototype/mx_formats) implementing training and inference support with tensors using the [OCP MX spec](https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf) data types, which can be described as groupwise scaled float8/float6/float4/int8, with the scales being constrained to powers of two. This work is prototype as the hardware support is not available yet.
118-
4. [Bleeding Edge Kernels](./torchao/prototype/) for experimental kernels without backwards compatibility guarantees
119-
- [GaLore](https://github.com/pytorch/ao/tree/main/torchao/prototype/galore) for memory efficient finetuning
120-
- [fused HQQ Gemm Kernel](https://github.com/pytorch/ao/tree/main/torchao/prototype/hqq) for compute bound workloads
121-
- [FP6-LLM](torchao/prototype/fp6_llm) mixed matmul FP16 x FP6 kernel for io bound workloads
122-
123-
## Our Goals
124-
125-
* Composability with `torch.compile`: We rely heavily on `torch.compile` to write pure PyTorch code and codegen efficient kernels. There are however limits to what a compiler can do so we don't shy away from writing our custom CUDA/Triton kernels
126-
* Composability with `FSDP`: The new support for FSDP per parameter sharding means engineers and researchers alike can experiment with different quantization and distributed strategies concurrently.
127-
* Performance: We measure our performance on every commit using an A10G. We also regularly run performance benchmarks on the [torchbench](https://github.com/pytorch/benchmark) suite
128-
* Heterogeneous Hardware: Efficient kernels that can run on CPU/GPU based server (w/ torch.compile) and mobile backends (w/ ExecuTorch).
129-
* Packaging kernels should be easy: We support custom [CUDA and Triton extensions](./torchao/csrc/) so you can focus on writing your kernels and we'll ensure that they work on most operating systems and devices
130-
131-
## Integrations
132-
133-
torchao has been integrated with other libraries including
134-
135-
* [torchtune](https://github.com/pytorch/torchtune/blob/main/recipes/quantization.md) leverages our 8 and 4 bit weight-only quantization techniques with optional support for GPTQ
136-
* [Executorch](https://github.com/pytorch/executorch/tree/main/examples/models/llama2#quantization) leverages our GPTQ implementation for both 8da4w (int8 dynamic activation with int4 weight) and int4 weight-only quantization.
137-
* [HQQ](https://github.com/mobiusml/hqq/blob/master/hqq/backends/torchao.py) leverages our int4mm kernel for low latency inference
138-
139-
## Success stories
140-
Our kernels have been used to achieve SOTA inference performance on
141-
142-
* Image segmentation models with [sam-fast](https://pytorch.org/blog/accelerating-generative-ai)
143-
* Language models with [gpt-fast](https://pytorch.org/blog/accelerating-generative-ai-2)
144-
* Diffusion models with [sd-fast](https://pytorch.org/blog/accelerating-generative-ai-3)
145-
146132
## License
147133

148134
`torchao` is released under the [BSD 3](https://github.com/pytorch-labs/ao/blob/main/LICENSE) license.

0 commit comments

Comments
 (0)