Description
The recent addition of optimizer CPU offload in torchao can be useful for single GPU low memory config.
https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload
In my brief testing main...gau-nernst:torchtune:optim_offload, there is ~25% increase in tok/s. Wandb project: https://wandb.ai/gau-nernst/torchtune. My system: 4070Ti SUPER (16GB VRAM), Ryzen 5600, DDR4.

There is also a difference in handling gradients memory.
- For CPU offload, I use
offload_gradients=True
inCPUOffloadOptimizer
, which free gradients once device-to-host transfer finishes. - For paged Adam, it is done via
optimizer_in_bwd=True
.
Regarding memory usage, it's pretty strange since in nvidia-smi, paged Adam run also occupies a lot of memory (near 16GB). Perhaps because bnb manages their own unified memory so PyTorch doesn't report it? Also, for RAM usage, htop reports 55.5GB for paged Adam, and 64.1GB for offload Adam.
We probably need more testing. In particular:
- Different system configurations. CPU offload Adam can be dependent on RAM and CPU speed, since optim step is done on CPU. Paged Adam might be faster when there is more spare GPU memory, since paged Adam does optim step on GPU. The optimal batch size (to maximize tok/s) for each config might be different too.
- Memory spike behavior. For CPU offload Adam, I had to add
expandable_segments:True
to prevent OOM in the middle of training. Memory spike behavior might be unpredictable with CPU offload Adam, since it is not well tested. The spike might come from gradients offloading (ref: Optimizer CPU offload for single GPU training ao#584 (comment), not 100% sure). I haven't tested paged Adam withoutexpandable_segments:True
yet.
Regardless, I think adding an extra option for low memory single GPU training is beneficial, even if it is not well-tested yet.
cc @msaroufim