Cannot restart training after training tenc 2 AND using fused_backward_pass

If you finetune SDXL base with:
```
--train_text_encoder --learning_rate_te1 1e-10 --learning_rate_te2 1e-10 --fused_backward_pass
```
Then it will train fine. But if you stop training and restart by training from the e.g. `<whatever>-step00001000.safetensors` file, you get this error message:
```
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1280, 1280]], which is output 0 of AsStridedBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
```
This doesn't happen if you only train te1 and the unet. It also only happens when you use --fused_backward_pass.


Full call stack:
```
Traceback (most recent call last):
  File "/home/ara/m.2/Dev/sdxl/sd-scripts/./sdxl_train.py", line 944, in <module>
    train(args)
  File "/home/ara/m.2/Dev/sdxl/sd-scripts/./sdxl_train.py", line 733, in train
    accelerator.backward(loss)
  File "/home/ara/m.2/Dev/sdxl/sd-scripts/venv/lib/python3.10/site-packages/accelerate/accelerator.py", line 1905, in backward
    loss.backward(**kwargs)
  File "/home/ara/m.2/Dev/sdxl/sd-scripts/venv/lib/python3.10/site-packages/torch/_tensor.py", line 522, in backward
    torch.autograd.backward(
  File "/home/ara/m.2/Dev/sdxl/sd-scripts/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/home/ara/m.2/Dev/sdxl/sd-scripts/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 289, in apply
    return user_fn(self, *args)
  File "/home/ara/m.2/Dev/sdxl/sd-scripts/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 319, in backward
    torch.autograd.backward(outputs_with_grad, args_with_grad)
  File "/home/ara/m.2/Dev/sdxl/sd-scripts/venv/lib/python3.10/site-packages/torch/autograd/__init__.py", line 266, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
```
(I also mentioned this bug back when the original pull request occurred: see https://github.com/kohya-ss/sd-scripts/pull/1259)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cannot restart training after training tenc 2 AND using fused_backward_pass #1369

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Cannot restart training after training tenc 2 AND using fused_backward_pass #1369

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions