Bad conversion quality after retraining

Hi,
first of all thanks for the great work on the AutoVC system.
I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system.
I use the the same pre-processing for the mel-spectrograms as discussed in issue #4
and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted).
I additionally used one-hot encodings instead of speaker embeddings of an encoder.

I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising.  I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice.
In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim.
Here are the mel-spectrograms of my retrained model and the model of the repo

Retrained model            |  Supplied model
:-------------------------:|:-------------------------:
![p270-p228-own](https://user-images.githubusercontent.com/58212792/69620007-e9834d80-103c-11ea-8d9c-4f704a167c88.png) |  ![p270-p228-paper](https://user-images.githubusercontent.com/58212792/69620016-edaf6b00-103c-11ea-81d4-f7cc510ab76f.png)




Here is a minimal example of the loss and training loop I use.
I can also provide more of my code if wanted.

```python
def train_step(mel_spec_batch, embeddings_batch, generator,
               weight_mu_zero_rec: float, weight_lambda_content: float):
    optimizer.zero_grad()

    mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
    mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
                                                                          embeddings_batch,
                                                                          embeddings_batch)
    # Returns content codes with self.encoder without using the decoder and postnet a second time
    content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)

    rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
    rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
    content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
    total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss

    total_loss.backward()
    optimizer.step()


# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
    generator.train()
    # Iterate over Mel-Spec Slices and the index of their speakers
    for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
        # Load the speaker embeddings of the speakers of the mel-spectograms
        spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
	train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
                   weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
                   weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
	# The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc
```


Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?

Thanks a lot in advance.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bad conversion quality after retraining #33

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bad conversion quality after retraining #33

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions