[RFC] Optimizer CPU offload from torchao for single GPU low memory config

The recent addition of optimizer CPU offload in torchao can be useful for single GPU low memory config.

https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

In my brief testing https://github.com/pytorch/torchtune/compare/main...gau-nernst:torchtune:optim_offload, there is **~25% increase in tok/s**. Wandb project: https://wandb.ai/gau-nernst/torchtune. My system: 4070Ti SUPER (16GB VRAM), Ryzen 5600, DDR4.

<img width="1025" alt="image" src="https://github.com/user-attachments/assets/f114f20f-a600-4759-8ea3-8e18f368e1e8">

There is also a difference in handling gradients memory.
- For CPU offload, I use `offload_gradients=True` in `CPUOffloadOptimizer`, which free gradients once device-to-host transfer finishes.
- For paged Adam, it is done via `optimizer_in_bwd=True`.

Regarding memory usage, it's pretty strange since in nvidia-smi, paged Adam run also occupies a lot of memory (near 16GB). Perhaps because bnb manages their own unified memory so PyTorch doesn't report it? Also, for RAM usage, htop reports 55.5GB for paged Adam, and 64.1GB for offload Adam.

We probably need more testing. In particular:
- **Different system configurations**. CPU offload Adam can be dependent on RAM and CPU speed, since optim step is done on CPU. Paged Adam might be faster when there is more spare GPU memory, since paged Adam does optim step on GPU. The optimal batch size (to maximize tok/s) for each config might be different too.
- **Memory spike behavior**. For CPU offload Adam, I had to add `expandable_segments:True` to prevent OOM in the middle of training. Memory spike behavior might be unpredictable with CPU offload Adam, since it is not well tested. The spike might come from gradients offloading (ref: https://github.com/pytorch/ao/pull/584#discussion_r1704667190, not 100% sure). I haven't tested paged Adam without `expandable_segments:True` yet.

Regardless, I think adding an extra option for low memory single GPU training is beneficial, even if it is not well-tested yet.

cc @msaroufim 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[RFC] Optimizer CPU offload from torchao for single GPU low memory config #1278

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RFC] Optimizer CPU offload from torchao for single GPU low memory config #1278

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions