Skip to content

Commit 1adf995

Browse files
address comment
1 parent 3c65b0f commit 1adf995

File tree

1 file changed

+9
-9
lines changed

1 file changed

+9
-9
lines changed

torchao/float8/README.md

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -215,15 +215,15 @@ and tensorwise scaling. The training benchmarks were all run using:
215215
- `torch.compile`
216216
- FSDP2
217217

218-
| Model | Scaling | Activation checkpointing | Median tokens/second | Peak Memory (GB) |
219-
| ------------- | ------------ | ------------------------ | ------------------------- | ---------------- |
220-
| Llama3-8b | none (bf16) | per op SAC | 6019 | 47.65 |
221-
| Llama3-8b | tensorwise | per op SAC | 7190 | 47.77 |
222-
| Llama3-8b | rowwise | per op SAC | 6649 | 47.79 |
223-
224-
In these benchmarks tensorwise scaling achieved ~8% higher tokens/second over rowwise scaling, and ~19.5% higher than the bf16 baseline.
225-
However, it is important to note that rowwise scaling has been shown to yield improvments in training loss/accuracy due to reduced quantization error, particularly
226-
when training large models for many steps.
218+
| Model | Scaling | Activation checkpointing | Peak Memory (GB) | Median tokens/second | Speedup over basline
219+
| ------------- | ------------ | ------------------------ | ------------------| -------------------- | ---------------------
220+
| Llama3-8b | none (bf16) | per op SAC | 47.65 | 6019 | -
221+
| Llama3-8b | tensorwise | per op SAC | 47.77 | 7190 | 19.45%
222+
| Llama3-8b | rowwise | per op SAC | 47.79 | 6649 | 10.47%
223+
224+
**Important notes**:
225+
- Speedups increase as M,K,N (GEMM dimensions) increase. Speedups as high as 1.5x have been measured with larger shapes ((example)[https://pytorch.org/blog/training-using-float8-fsdp2/]).
226+
- Rowwise scaling is better at handling outliers than tensorwise scaling, so these recipes are different points on the accuracy vs performance curve.
227227

228228
**Reproducing training benchmarks**
229229
To reproduce these benchmarks, you can follow these steps:

0 commit comments

Comments
 (0)