Open
Description
🖥 Benchmarking transformers
w/ HF Trainer on a single A100 40GB
We are going to use a special benchmarking tool that will do all the work for us. #14934
This is the index post and specific benchmarks are in their own posts below:
- fp16 vs bf16 vs tf32 vs fp32
- gradient accumulation steps
- batch size
- gradient checkpointing
- optimizers
- combining winning strategies ~3x speed improvement!
- RTX-3090 vs A100
Note that each benchmark was run only once, so multiple runs and averaging is probably going to give slightly different results. The purpose here though is to see relative differences roughly and not try to give an exact number.
See also the same benchmarks for RTX-3090