CUDA out of memory despite having 13GiB of free memory.

I am training a LOHA with a 3090. When I try to start the training, it gives the following error:

```
CUDA SETUP: Loading binary D:\stable-diffusion\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
use 8-bit AdamW optimizer | {}
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 800
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 50
  num epochs / epoch数: 30
  batch size per device / バッチサイズ: 8
  gradient accumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 1500
steps:   0%|                                                                                               | 0/1500 [00:00<?, ?it/s]epoch 1/30
Traceback (most recent call last):
  File "D:\stable-diffusion\kohya_ss\train_network.py", line 699, in <module>
    train(args)
  File "D:\stable-diffusion\kohya_ss\train_network.py", line 538, in train
    noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\accelerate\utils\operations.py", line 490, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\torch\amp\autocast_mode.py", line 12, in decorate_autocast
    return func(*args, **kwargs)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\diffusers\models\unet_2d_condition.py", line 407, in forward
    sample = upsample_block(
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\diffusers\models\unet_2d_blocks.py", line 1203, in forward
    hidden_states = attn(hidden_states, encoder_hidden_states=encoder_hidden_states).sample
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\diffusers\models\attention.py", line 216, in forward
    hidden_states = block(hidden_states, context=encoder_hidden_states, timestep=timestep)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\diffusers\models\attention.py", line 494, in forward
    hidden_states = self.ff(self.norm3(hidden_states)) + hidden_states
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\diffusers\models\attention.py", line 709, in forward
    hidden_states = module(hidden_states)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\torch\nn\modules\module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\diffusers\models\attention.py", line 756, in forward
    return hidden_states * self.gelu(gate)
RuntimeError: CUDA out of memory. Tried to allocate 30.00 MiB (GPU 0; 24.00 GiB total capacity; 8.54 GiB already allocated; 13.28 GiB free; 8.73 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
steps:   0%|                                                                                               | 0/1500 [00:26<?, ?it/s]
Traceback (most recent call last):
  File "C:\Users\James\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Users\James\AppData\Local\Programs\Python\Python310\lib\runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "D:\stable-diffusion\kohya_ss\venv\Scripts\accelerate.exe\__main__.py", line 7, in <module>
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\accelerate\commands\accelerate_cli.py", line 45, in main
    args.func(args)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "D:\stable-diffusion\kohya_ss\venv\lib\site-packages\accelerate\commands\launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['D:\\stable-diffusion\\kohya_ss\\venv\\Scripts\\python.exe', 'train_network.py', '--enable_bucket', '--pretrained_model_name_or_path=D:/stable-diffusion/lib/models/anythingV5Anything_anythingV5PrtRE.safetensors', '--train_data_dir=D:\\stable-diffusion\\training\\2023-03-26 kore zombie ts\\src3', '--resolution=512,512', '--output_dir=D:\\stable-diffusion\\training\\2023-03-26 kore zombie ts\\trn_v3', '--logging_dir=', '--network_alpha=1', '--save_model_as=safetensors', '--network_module=lycoris.kohya', '--network_args', 'conv_dim=1', 'conv_alpha=1', 'algo=loha', '--text_encoder_lr=5e-5', '--unet_lr=0.0001', '--network_dim=8', '--output_name=korets_v3.0', '--lr_scheduler_num_cycles=30', '--learning_rate=0.0001', '--lr_scheduler=cosine', '--lr_warmup_steps=150', '--train_batch_size=8', '--max_train_steps=1500', '--save_every_n_epochs=1', '--mixed_precision=bf16', '--save_precision=fp16', '--caption_extension=.txt', '--cache_latents', '--optimizer_type=AdamW8bit', '--clip_skip=2', '--bucket_reso_steps=64', '--xformers', '--bucket_no_upscale', '--sample_sampler=euler_a', '--sample_prompts=D:\\stable-diffusion\\training\\2023-03-26 kore zombie ts\\trn_v3\\sample\\prompt.txt', '--sample_every_n_epochs=1']' returned non-zero exit status 1.
```

In particular, it says: `Tried to allocate 30.00 MiB (GPU 0; 24.00 GiB total capacity; 8.54 GiB already allocated; 13.28 GiB free; 8.73 GiB reserved in total by PyTorch)`. I'm not sure if this is a bug or not? It's saying there's more free memory than it is trying to allocate yet fails to allocate it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA out of memory despite having 13GiB of free memory. #477

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development