In the second stage of training, I kept the batch to 16, and the Nan issues are not shown again with 8 GPU training.
However, once the epoch reaches 50, which is set for joint_epoch in config, I have run into the errors:

Traceback (most recent call last): File "StyleTTS2/train_second.py", line 789, in <module> main() File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "StyleTTS2/train_second.py", line 497, in main loss_gen_lm.backward() File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/torch/_tensor.py", line 492, in backward torch.autograd.backward( File "/home/ubuntu/miniconda3/envs/styletts2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [1, 1, 1]] is at version 3; expected version 1 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

The issue happens during the backward for loss_gen_lm. my pytorch version is 2.1.0

g_loss is None in second stage training #11

Description

Activity

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions