Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to execute denoising? #35

Open
a897456 opened this issue Oct 26, 2024 · 17 comments
Open

How to execute denoising? #35

a897456 opened this issue Oct 26, 2024 · 17 comments

Comments

@a897456
Copy link

a897456 commented Oct 26, 2024

@bigpon Hi
I'm trying to reproduce the denoising code.
https://github.com/facebookresearch/AudioDec?tab=readme-ov-file#bonus-track-denoising
You mentioned following the requirements in submit_denoise.sh in this paragraph "Prepare the noisy-clean corpus and follow the usage instructions in submit_denoise.sh to run the training and testing", but the execution code below is submit_autoencoder.sh. May I ask what should be done?

@a897456
Copy link
Author

a897456 commented Oct 26, 2024

self.model["generator"].quantizer.codebook.eval()

Is the denoising process the same as that of the autoencoder. Does it require training the metric_loss first and then fixing the weights to continue training?

@a897456
Copy link
Author

a897456 commented Oct 27, 2024

Hi @bigpon
I completed 20,000 training sessions according to Stage 0 of submit_denoise.sh. However, when I started to execute Stage 1, it seemed that there was no response at all.

AudioDec/bin/train.py

Lines 106 to 118 in 5ec3ab9

def run(self):
try:
logging.info(f"The current training step: {self.trainer.steps}")
self.trainer.train_max_steps = self.config["train_max_steps"]
if not self.trainer._check_train_finish():
self.trainer.run()
if self.config.get("adv_train_max_steps", False) and self.config.get("adv_batch_length", False):
self.batch_length = self.config['adv_batch_length']
logging.info(f"Reload dataloader for adversarial training.")
self.initialize_data_loader()
self.trainer.data_loader = self.data_loader
self.trainer.train_max_steps = self.config["adv_train_max_steps"]
self.trainer.run()

I suspect that perhaps the denoising process such as adv_train_max_steps or adv_batch_length doesn't require adversarial parameters, because I didn't find them in the configuration file like config/denoise/symAD_vctk_48000_hop300.yaml.
start_steps: # Number of steps to start training
generator: 0
discriminator: 200000
train_max_steps: 200000 # Number of training steps.
save_interval_steps: 100000 # Interval steps to save checkpoint.
eval_interval_steps: 1000 # Interval steps to evaluate the network.
log_interval_steps: 100 # Interval steps to record the training log.

@bigpon
Copy link
Contributor

bigpon commented Oct 28, 2024

Hi,
there is a typo.
For running the denoising process, you have to first update the encoder while fixing the codebook and decoder.
I update the README.
Please follow the steps there.

@a897456
Copy link
Author

a897456 commented Oct 29, 2024

# stage 0
if echo ${stage} | grep -q 0; then
echo "Denoising Training"
config_name="config/${encoder}.yaml"
echo "Configuration file="$config_name
python codecTrain.py \
-c ${config_name} \
--tag ${encoder} \
--exp_root ${exp} \
--disable_cudnn ${disable_cudnn}
fi

model_type: symAudioDec
train_mode: denoise
initial: exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl # for model initialization

AudioDec/codecTrain.py

Lines 239 to 255 in 9cc4e58

# MODEL INITIALIZATION
def initialize_model(self):
initial = self.config.get("initial", "")
if os.path.exists(self.resume): # resume from trained model
self.trainer.load_checkpoint(self.resume)
logging.info(f"Successfully resumed from {self.resume}.")
elif os.path.exists(initial): # initial new model with the pre-trained model
self.trainer.load_checkpoint(initial, load_only_params=True)
logging.info(f"Successfully initialize parameters from {initial}.")
else:
logging.info("Train from scrach")
# load the pre-trained encoder for vocoder training
if self.train_mode in ['vocoder']:
analyzer_checkpoint = self.config.get("analyzer", "")
assert os.path.exists(analyzer_checkpoint), f"Analyzer {analyzer_checkpoint} does not exist!"
analyzer_config = self._load_config(analyzer_checkpoint)
self._initialize_analyzer(analyzer_config, analyzer_checkpoint)

I executed stage 0 according to submit_denoise.sh. However, I found that in the configuration file the file exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl will be loaded as initial during stage 0. Do I need to train this file out in advance (for the new dataset)?

@a897456
Copy link
Author

a897456 commented Oct 29, 2024

Hi @bigpon
Could you help me analyze whether my understanding is correct or not? THS

  1. First, according to the config/autoencoder/symAD_vctk_48000_hop300.yaml, perform autoencoder training on the clean speech for 200k steps to obtain a exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl file.
  2. Then, according to the config/denoise/symAD_vctk_48000_hop300.yaml, simultaneously use the exp/autoencoder/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl file obtained in step 1 as the initial, and conduct 200k steps of denoise training on both the clean speech and the noisy speech to get another file exp/denoise/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl.
    Up to this point, the denoising training process is completed. The testing process is as follows:
    1.In codetest.py, set process.encoder=decoder=exp/denoise/symAD_vctk_48000_hop300/checkpoint-200000steps.pkl to complete the testing

@bigpon
Copy link
Contributor

bigpon commented Oct 29, 2024

Hi, in the first step, you have to train the decoder for another 500k iteration with GAN.

In the final step, you should take the decoder from the one trained with GAN.

@a897456
Copy link
Author

a897456 commented Nov 3, 2024

Hi @bigpon
I carried out the denoising process as you suggested. However, when I tested the PESQ score of the output audio, it was only 1.6. Meanwhile, I also listened to it and subjectively felt that it was just so-so. The following is the denoising process. Do you have any ways to improve the effect? Thank you.
image

@a897456
Copy link
Author

a897456 commented Nov 3, 2024

Hi bigpon,
My idea was to add the training of the discriminator in denoise.py, imitating the method in autoencoder.py. I actually did it this way, but the results still didn't improve.
image

@bigpon
Copy link
Contributor

bigpon commented Nov 4, 2024

Because of the phase misaligned issue (you can check our paper ScoreDec), AudioDec usually achieves low PESQ even when the input is clean speech. Using multi-resolution mel-loss can improve the PESQ but it still cannot achieve a very high PESQ score.

For perceptual quality, although the PESQ score is low, the quality should be OK.

However, since it is just a simple approach to update only the encoder, it only achieves an OK performance, which still falls behind the SOTA speech enhancement methods.

@a897456
Copy link
Author

a897456 commented Nov 5, 2024

Because of the phase misaligned issue (you can check our paper ScoreDec), AudioDec usually achieves low PESQ even when the input is clean speech. Using multi-resolution mel-loss can improve the PESQ but it still cannot achieve a very high PESQ score.

Hi @bigpon
1.When is the ScoreDec expected to be open sourced?
2.Can the phase problem be compensated by setting use_shape_loss=true? I see that this value is always false in the configuration file.

@bigpon
Copy link
Contributor

bigpon commented Nov 5, 2024

Hi,

  1. We don't have any plan to release ScoreDec since people can easily train the post-filter model from this repo https://github.com/sp-uhh/sgmse . That is, once you prepare the AudioDec-coded- and natural- speech pair as the noisy and clean pairs, you can train a sgmse-based postfilter. Actually, I also used sgmse to do denoising, and it works well. Therefore, I recommend you use your current trained AudioDec (w/o the GAN training part, i.e. only the 1st stage) to prepare noisy-clean speech pairs, and then train a sgmse model with these pairs. After that, you can get a high-quality denoising codec (The phase is also aligned well). The only problem is the inference time is very slow because of the sgmse model.

  2. No. The shape loss mostly improves the loudness modeling, and it cannot improve the phase modeling.

@a897456
Copy link
Author

a897456 commented Nov 7, 2024

Therefore, I recommend you use your current trained AudioDec (w/o the GAN training part, i.e. only the 1st stage) to prepare noisy-clean speech pairs, and then train a sgmse model with these pairs.

Hi @bigpon
When it comes to preparing noisy-clean speech pairs, does it mean that the new noisy speech obtained after the original noisy speech goes through AudioDec (w/o the GAN training part, i.e. only the 1st stage) should be grouped with the original clean speech? Or do both the original noisy speech and the original clean speech need to go through AudioDec?

@bigpon
Copy link
Contributor

bigpon commented Nov 7, 2024

Hi, in this case, we want the postfilter to do two things.

  1. remove the noise
  2. compensate the codec distortion

Therefore the target speech is the clean speech without any process (i.e. the ground truth).
The noisy/input speech can be
Type I. noisy speech processed by 1st-stage AudioDec (suffering from both noise and codec distortions)
Type II. Clean speech processed by 1st-stage AudioDec (suffering from only codec distortions)

I have tried to use only I or I+II to train the postfilter.
For noisy speech, their performances are similar.
For clean speech, the model trained with I + II is better.

Therefore, I suggest you prepare both (Type-I, clean_speech) and (Type-II, clean_speech) pairs to train the postfilter.

@a897456
Copy link
Author

a897456 commented Nov 10, 2024

图片
Hi @bigpon
I reorganized the dataset according to the suggestions you gave me. Then, under all the default settings, I carried out the training of SGMSE. The purple PESQ curve represents the unprocessed dataset, while the green PESQ curve represents the dataset that has been processed by Audiodec (including clean speech and noisy speech). However, I feel that the upward trend of PESQ has become sluggish.
I guess that perhaps the SGMSE might require some specific settings. But I have been using the default settings completely. I will update the results here again. Meanwhile, if you can identify where the problem lies, please remind me in a timely manner.

@a897456
Copy link
Author

a897456 commented Nov 11, 2024

Hi @bigpon
Is SGMSE already obsolete? I see that the PESQ scores of many speech enhancement models have already reached 3.6.
image
image
image

@a897456
Copy link
Author

a897456 commented Nov 12, 2024

Hi @bigpon
图片
The curve of PESQ is still quite poor. I think there are some problems with the settings I've made, but I still haven't managed to find the correct ones. Could you please provide the setting parameters you had at that time? Including parameters such as backbone and SDE. I would be extremely grateful.

@a897456
Copy link
Author

a897456 commented Nov 12, 2024

Hi @bigpon
image
Are you using the settings of M6? Or something else? Could you disclose it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants