Dreambooth doesn't train on 8GB

### Describe the bug

Per the example featured in the repo, it goes OOM when DeepSpeed is loading the optimizer, tested on a 3080 10GB + 64GB RAM in WSL2 and native Linux.

### Reproduction

Follow the pastebin for setup purposes (on WSL2), or just try it yourself https://pastebin.com/0NHA5YTP

### Logs

```shell
The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_cpu_threads_per_process` was set to `8` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2022-10-11 17:16:38,700] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2022-10-11 17:16:48,338] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.3, git-hash=unknown, git-branch=unknown
[2022-10-11 17:16:50,220] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-10-11 17:16:50,221] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-10-11 17:16:50,221] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-10-11 17:16:50,271] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = {basic_optimizer.__class__.__name__}
[2022-10-11 17:16:50,272] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-10-11 17:16:50,272] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:134:__init__] Reduce bucket size 500000000
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:135:__init__] Allgather bucket size 500000000
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:136:__init__] CPU Offload: True
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:137:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.15820693969726562 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)]
[2022-10-11 17:16:52,613] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-11 17:16:52,614] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB         Max_MA 1.66 GB         CA 3.27 GB         Max_CA 3 GB
[2022-10-11 17:16:52,614] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory:  used = 7.68 GB, percent = 16.3%
Traceback (most recent call last):
  File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 598, in <module>
    main()
  File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 478, in main
    unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 679, in prepare
    result = self._prepare_deepspeed(*args)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 890, in _prepare_deepspeed
    engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/__init__.py", line 124, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 320, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1144, in _configure_optimizer
    self.optimizer = self._configure_zero_optimizer(basic_optimizer)
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1395, in _configure_zero_optimizer
    optimizer = DeepSpeedZeroOptimizer(
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in __init__
    self.initialize_optimizer_states()
  File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 599, in initialize_optimizer_states
    i].grad = single_grad_partition.pin_memory(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59) of binary: /root/anaconda3/envs/diffusers-ttl/bin/python
```


### System Info

3080 10GB + 64GB RAM, WSL2 and Linux

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Dreambooth doesn't train on 8GB #807

Describe the bug

Reproduction

Logs

System Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Dreambooth doesn't train on 8GB #807

Description

Describe the bug

Reproduction

Logs

System Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions