Skip to content

[RFC] Optimizer CPU offload from torchao for single GPU low memory config #1278

Closed
@gau-nernst

Description

@gau-nernst

The recent addition of optimizer CPU offload in torchao can be useful for single GPU low memory config.

https://github.com/pytorch/ao/tree/main/torchao/prototype/low_bit_optim#optimizer-cpu-offload

In my brief testing main...gau-nernst:torchtune:optim_offload, there is ~25% increase in tok/s. Wandb project: https://wandb.ai/gau-nernst/torchtune. My system: 4070Ti SUPER (16GB VRAM), Ryzen 5600, DDR4.

image

There is also a difference in handling gradients memory.

  • For CPU offload, I use offload_gradients=True in CPUOffloadOptimizer, which free gradients once device-to-host transfer finishes.
  • For paged Adam, it is done via optimizer_in_bwd=True.

Regarding memory usage, it's pretty strange since in nvidia-smi, paged Adam run also occupies a lot of memory (near 16GB). Perhaps because bnb manages their own unified memory so PyTorch doesn't report it? Also, for RAM usage, htop reports 55.5GB for paged Adam, and 64.1GB for offload Adam.

We probably need more testing. In particular:

  • Different system configurations. CPU offload Adam can be dependent on RAM and CPU speed, since optim step is done on CPU. Paged Adam might be faster when there is more spare GPU memory, since paged Adam does optim step on GPU. The optimal batch size (to maximize tok/s) for each config might be different too.
  • Memory spike behavior. For CPU offload Adam, I had to add expandable_segments:True to prevent OOM in the middle of training. Memory spike behavior might be unpredictable with CPU offload Adam, since it is not well tested. The spike might come from gradients offloading (ref: Optimizer CPU offload for single GPU training ao#584 (comment), not 100% sure). I haven't tested paged Adam without expandable_segments:True yet.

Regardless, I think adding an extra option for low memory single GPU training is beneficial, even if it is not well-tested yet.

cc @msaroufim

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions