-
Notifications
You must be signed in to change notification settings - Fork 214
Description
Hi,
first of all thanks for the great work on the AutoVC system.
I have tried to replicate the system, but could not achieve to achieve nearly the same quality as the pre-trained system.
I use the the same pre-processing for the mel-spectrograms as discussed in issue #4
and have trained the system with he same 20 VCTK speakers of the experiment of the paper (additionally with 8 speakers from the VCC data set, however results were similar when they were omitted).
I additionally used one-hot encodings instead of speaker embeddings of an encoder.
I trained for about 300.000 steps using Adam with default parameters and a LR of 0.0001, the train loss is about 6.67e-3 and the validation loss is about 0.01 and rising. I've also tried out other learning rates (0.001, 0.0005) with no improvement of quality. The converted mel-spectrograms are still blurry and produce a low-quality, robotic voice.
In comparison, the converted mel-spectrograms of the supplied autovc model are much sharper and produce more natural voice, even when used with Griffin-Lim.
Here are the mel-spectrograms of my retrained model and the model of the repo
Retrained model | Supplied model |
---|---|
![]() |
![]() |
Here is a minimal example of the loss and training loop I use.
I can also provide more of my code if wanted.
def train_step(mel_spec_batch, embeddings_batch, generator,
weight_mu_zero_rec: float, weight_lambda_content: float):
optimizer.zero_grad()
mel_spec_batch_exp = mel_spec_batch.unsqueeze(1) # (batch_size=2, 1, num_frames=128, num_mels=80)
mel_outputs, mel_outputs_postnet, content_codes_mel_input = generator(mel_spec_batch,
embeddings_batch,
embeddings_batch)
# Returns content codes with self.encoder without using the decoder and postnet a second time
content_codes_gen_output = generator.get_content_codes(mel_outputs_postnet, embeddings_batch)
rec_loss = F.mse_loss(input=mel_outputs_postnet, target=mel_spec_batch_exp, reduction="mean")
rec_0_loss = F.mse_loss(input=mel_outputs, target=mel_spec_batch_exp, reduction="mean")
content_loss = F.l1_loss(input=content_codes_gen_output, target=content_codes_mel_input, reduction="mean")
total_loss = rec_loss + weight_mu_zero_rec * rec_0_loss + weight_lambda_content * content_loss
total_loss.backward()
optimizer.step()
# Train loop..
for epoch in range(start_epoch + 1, args[FLAGS.MAX_NUM_EPOCHS] + 1):
generator.train()
# Iterate over Mel-Spec Slices and the index of their speakers
for step_idx, (mel_spec_batch, speaker_idx_batch) in enumerate(train_set_loader):
# Load the speaker embeddings of the speakers of the mel-spectograms
spkr_embeddings = speaker_embedding_mat[speaker_idx_batch.to(device)].to(device)
train_step(mel_spec_batch.to(device), spkr_embeddings, generator, optim,
weight_mu_zero_rec=args[FLAGS.AUTO_VC_MU_REC_LOSS_BEFORE_POSTNET], # == 1.0
weight_lambda_content=args[FLAGS.AUTO_VC_LAMBDA_CONTENT_LOSS]) # == 1.0
# The rest is computing the validation loss, resynthesizing utterances, saving the model every n epochs, etc
Does anyone have an idea what is wrong with my re-implementation or could anyone reimplement the system with good quality?
Thanks a lot in advance.