Extremely weird DDP issue for train_second.py

So far [train_second.py](https://github.com/yl4579/StyleTTS2/blob/main/train_second.py) only works with DataParallel (DP) but not DistributedDataParalell (DDP). One major problem with this is if we simply translate DP to DDP (code in the comment section), we encounter the following problem:

`RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 6; expected version 5 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!
`
It is insanely difficult to debug. The tensor has no batch dimension, indicating it might be a parameter in the neural network. I found the tensor to be the bias term of the last Conv1D layer of `predictor_encoder` (prosodic style encoder): https://github.com/yl4579/StyleTTS2/blob/main/models.py#L152. This is extremely weird because the problem does not trigger for any Conv1D layer before this. 

More mysteriously, issue surprisingly disappears if we add `model.predictor_encoder.train()` near [line 250](https://github.com/yl4579/StyleTTS2/blob/main/train_second.py#L250) of `train_second.py`. **However, this causes the F0 loss to be much higher than without this line.** This is true for both DP and DDP, so the higher F0 loss value is caused by `model.predictor_encoder.train()`, not DDP. Unfortunately, the `predictor_encoder`, which is `StyleEncoder`, has no module that changes the behavior depending on whether it is in train or eval mode. The output is exactly the same whether it is set to train or eval. 

**TLDR: There are three issues with `train_second.py`**:
1. DDP does not work because of the in-place operation error. The error disappears if `model.predictor_encoder.train()` before training.
2. However, `model.predictor_encoder.train()` causes F0 loss to be much higher after convergence. This issue is independent of using DP or DDP.
3. `model.predictor_encoder` is an instantiation of `StyleEncoder`, which has no components that change the output depending on its train or eval mode. 

This problem has bugged me for more than a month, but I can't find a solution to it. It would be greatly appreciated if anyone has any insight into how to fix this problem. I have pasted the broken DDP code with accelerator below. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extremely weird DDP issue for train_second.py #7

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Extremely weird DDP issue for train_second.py #7

Description

Activity

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions