Skip to content

Bad conversion quality after retraining #33

@kvnsq

Description

@kvnsq

Hi,
first of all thanks for the great work on the AutoVC system.
I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system.
I use the the same pre-processing for the mel-spectrograms as discussed in issue #4
and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted).
I additionally used one-hot encodings instead of speaker embeddings of an encoder.

I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice.
In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim.
Here are the mel-spectrograms of my retrained model and the model of the repo

Retrained model Supplied model
p270-p228-own p270-p228-paper

Here is a minimal example of the loss and training loop I use.
I can also provide more of my code if wanted.

def train_step(mel_spec_batch, embeddings_batch, generator,
               weight_mu_zero_rec: float, weight_lambda_content: float):
    optimizer.zero_grad()

    mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
    mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
                                                                          embeddings_batch,
                                                                          embeddings_batch)
    # Returns content codes with self.encoder without using the decoder and postnet a second time
    content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)

    rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
    rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
    content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
    total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss

    total_loss.backward()
    optimizer.step()


# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
    generator.train()
    # Iterate over Mel-Spec Slices and the index of their speakers
    for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
        # Load the speaker embeddings of the speakers of the mel-spectograms
        spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
	train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
                   weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
                   weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
	# The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc

Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?

Thanks a lot in advance.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions