Closed
Description
Describe the bug
Per the example featured in the repo, it goes OOM when DeepSpeed is loading the optimizer, tested on a 3080 10GB + 64GB RAM in WSL2 and native Linux.
Reproduction
Follow the pastebin for setup purposes (on WSL2), or just try it yourself https://pastebin.com/0NHA5YTP
Logs
The following values were not passed to `accelerate launch` and had defaults used instead:
`--num_cpu_threads_per_process` was set to `8` to improve out-of-box performance
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
[2022-10-11 17:16:38,700] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
[2022-10-11 17:16:48,338] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed info: version=0.7.3, git-hash=unknown, git-branch=unknown
[2022-10-11 17:16:50,220] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Flops Profiler Enabled: False
[2022-10-11 17:16:50,221] [INFO] [logging.py:68:log_dist] [Rank 0] Removing param_group that has no 'params' in the client Optimizer
[2022-10-11 17:16:50,221] [INFO] [logging.py:68:log_dist] [Rank 0] Using client Optimizer as basic optimizer
[2022-10-11 17:16:50,271] [INFO] [logging.py:68:log_dist] [Rank 0] DeepSpeed Basic Optimizer = {basic_optimizer.__class__.__name__}
[2022-10-11 17:16:50,272] [INFO] [utils.py:52:is_zero_supported_optimizer] Checking ZeRO support for optimizer=AdamW type=<class 'torch.optim.adamw.AdamW'>
[2022-10-11 17:16:50,272] [INFO] [logging.py:68:log_dist] [Rank 0] Creating fp16 ZeRO stage 2 optimizer
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:134:__init__] Reduce bucket size 500000000
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:135:__init__] Allgather bucket size 500000000
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:136:__init__] CPU Offload: True
[2022-10-11 17:16:50,272] [INFO] [stage_1_and_2.py:137:__init__] Round robin gradient partitioning: False
Using /root/.cache/torch_extensions/py39_cu113 as PyTorch extensions root...
Emitting ninja build file /root/.cache/torch_extensions/py39_cu113/utils/build.ninja...
Building extension module utils...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module utils...
Time to load utils op: 0.15820693969726562 seconds
Rank: 0 partition count [1] and sizes[(859520964, False)]
[2022-10-11 17:16:52,613] [INFO] [utils.py:827:see_memory_usage] Before initializing optimizer states
[2022-10-11 17:16:52,614] [INFO] [utils.py:828:see_memory_usage] MA 1.66 GB Max_MA 1.66 GB CA 3.27 GB Max_CA 3 GB
[2022-10-11 17:16:52,614] [INFO] [utils.py:836:see_memory_usage] CPU Virtual Memory: used = 7.68 GB, percent = 16.3%
Traceback (most recent call last):
File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 598, in <module>
main()
File "/root/github/diffusers-ttl/examples/dreambooth/train_dreambooth.py", line 478, in main
unet, optimizer, train_dataloader, lr_scheduler = accelerator.prepare(
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 679, in prepare
result = self._prepare_deepspeed(*args)
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/accelerate/accelerator.py", line 890, in _prepare_deepspeed
engine, optimizer, _, lr_scheduler = deepspeed.initialize(**kwargs)
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/__init__.py", line 124, in initialize
engine = DeepSpeedEngine(args=args,
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 320, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1144, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1395, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer(
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 512, in __init__
self.initialize_optimizer_states()
File "/root/anaconda3/envs/diffusers-ttl/lib/python3.9/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 599, in initialize_optimizer_states
i].grad = single_grad_partition.pin_memory(
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59) of binary: /root/anaconda3/envs/diffusers-ttl/bin/python
System Info
3080 10GB + 64GB RAM, WSL2 and Linux