You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Harvest](https://github.com/mmorise/World) (Harvest: A high-performance fundamental frequency estimator from speech signals) is the recommended pitch extractor from Masanori Morise's WORLD, a free software for high-quality speech analysis, manipulation and synthesis. It is a state-of-the-art algorithmic pitch estimator designed for speech, but has seen use in singing voice synthesis. It runs the slowest compared to the others, but provides very accurate F0 on clean and normal recordings compared to parselmouth.
212
212
213
213
To use Harvest, simply include the following line in your configuration file:
214
+
214
215
```yaml
215
216
pe: harvest
216
217
```
217
218
218
219
**Note:** It is also recommended to change the F0 detection range for Harvest with accordance to your dataset, as they are hard boundaries for this algorithm and the defaults might not suffice for most use cases. To change the F0 detection range, you may include or edit this part in the configuration file:
220
+
221
+
```yaml
222
+
f0_min: 65 # Minimum F0 to detect
223
+
f0_max: 800 # Maximum F0 to detect
224
+
```
225
+
226
+
## Shallow diffusion
227
+
228
+
Shallow diffusion is a mechanism that can improve quality and save inference time for diffusion models that was first introduced in the original DiffSinger [paper](https://arxiv.org/abs/2105.02446). Instead of starting the diffusion process from purely gaussian noise as classic diffusion does, shallow diffusion adds a shallow gaussian noise on a low-quality results generated by a simple network (which is called the auxiliary decoder) to skip many unnecessary steps from the beginning. With the combination of shallow diffusion and sampling acceleration algorithms, we can get better results under the same inference speed as before, or achieve higher inference speed without quality deterioration.
229
+
230
+
Currently, acoustic models in this repository support shallow diffusion. The main switch of shallow diffusion is `use_shallow_diffusion` in the configuration file, and most arguments of shallow diffusion can be adjusted under `shallow_diffusion_args`. See [Configuration Schemas](ConfigurationSchemas.md) for more details.
231
+
232
+
### Train full shallow diffusion models from scratch
233
+
234
+
To train a full shallow diffusion model from scratch, simply introduce the following settings in your configuration file:
235
+
236
+
```yaml
237
+
use_shallow_diffusion: true
238
+
K_step: 400 # adjust according to your needs
239
+
K_step_infer: 400 # should be <= K_step
240
+
```
241
+
242
+
Please note that when shallow diffusion is enabled, only the last $K$ diffusion steps will be trained. Unlike classic diffusion models which are trained on full steps, the limit of `K_step` can make the training more efficient. However, `K_step` should not be set too small because without enough diffusion depth (steps), the low-quality auxiliary decoder results cannot be well refined. 200 ~ 400 should be the proper range of `K_step`.
243
+
244
+
The auxiliary decoder and the diffusion decoder shares the same linguistic encoder, which receives gradients from both the decoders. In some experiments, it was found that gradients from the auxiliary decoder will cause mismatching between the encoder and the diffusion decoder, resulting in the latter being unable to produce reasonable results. To prevent this case, a configuration item called `aux_decoder_grad` is introduced to apply a scale factor on the gradients from the auxiliary decoder during training. To adjust this factor, introduce the following in the configuration file:
245
+
246
+
```yaml
247
+
shallow_diffusion_args:
248
+
aux_decoder_grad: 0.1 # should not be too high
249
+
```
250
+
251
+
### Train auxiliary decoder and diffusion decoder separately
252
+
253
+
Training a full shallow diffusion model can consume more memory because the auxiliary decoder is also in the training graph. In limited situations, the two decoders can be trained separately, i.e. train one decoder after another.
254
+
255
+
**STEP 1: train the diffusion decoder**
256
+
257
+
In the first stage, the linguistic encoder and the diffusion decoder is trained together, while the auxiliary decoder is left unchanged. Edit your configuration file like this:
258
+
219
259
```yaml
220
-
f0_min: 65 # Minimum F0 to detect
221
-
f0_max: 800 # Maximum F0 to detect
260
+
use_shallow_diffusion: true # make sure the main option is turned on
261
+
shallow_diffusion_args:
262
+
train_aux_decoder: false # exclude the auxiliary decoder from the training graph
263
+
train_diffusion: true # train diffusion decoder as normal
264
+
val_gt_start: true # should be true because the auxiliary decoder is not trained yet
222
265
```
223
266
267
+
Start training until `max_updates` is reached, or until you get satisfactory results on the TensorBoard.
268
+
269
+
**STEP 2: train the auxiliary decoder**
270
+
271
+
In the second stage, the auxiliary decoder is trained besides the linguistic encoder and the diffusion decoder. Edit your configuration file like this:
272
+
273
+
```yaml
274
+
shallow_diffusion_args:
275
+
train_aux_decoder: true
276
+
train_diffusion: false # exclude the diffusion decoder from the training graph
277
+
lambda_aux_mel_loss: 1.0 # no more need to limit the auxiliary loss
278
+
```
279
+
280
+
Then you should freeze the encoder to prevent it from getting updates. This is because if the encoder changes, it no longer matches with the diffusion decoder, thus making the latter unable to produce correct results again. Edit your configuration file:
281
+
282
+
```yaml
283
+
freezing_enabled: true
284
+
frozen_params:
285
+
- model.fs2 # the linguistic encoder
286
+
```
287
+
288
+
You should also manually reset your learning rate scheduler because this is a new training process for the auxiliary decoder. Possible ways are:
289
+
290
+
1. Rename the latest checkpoint to `model_ckpt_steps_0.ckpt` and remove the other checkpoints from the directory.
291
+
2. Increase the initial learning rate (if you use a scheduler that decreases the LR over training steps) so that the auxiliary decoder gets proper learning rate.
292
+
293
+
Additionally, `max_updates` should be adjusted to ensure enough training steps for the auxiliary decoder.
294
+
295
+
Once you finished the configurations above, you can resume the training. The auxiliary decoder normally does not need many steps to train, and you can stop training when you get stable results on the TensorBoard. Because this step is much more complicated than the previous step, it is recommended to run some inference to verify if the model is trained properly after everything is finished.
296
+
297
+
### Add shallow diffusion to classic diffusion models
298
+
299
+
Actually, all classic DDPMs have the ability to be "shallow". If you want to add shallow diffusion functionality to a former classic diffusion model, the only thing you need to do is to train an auxiliary decoder for it.
300
+
301
+
Before you start, you should edit the configuration file to ensure that you use the same datasets, and that you do not remove or add any of the functionalities of the old model. Then you can configure the old checkpoint in your configuration file:
302
+
303
+
```yaml
304
+
finetune_enabled: true
305
+
finetune_ckpt_path: xxx.ckpt # path to your old checkpoint
306
+
finetune_ignored_params: [] # do not ignore any parameters
307
+
```
308
+
309
+
Then you can follow the instructions in STEP 2 of the [previous section](#add-shallow-diffusion-to-classic-diffusion-models) to finish your training.
310
+
224
311
## Performance tuning
225
312
226
313
This section is about accelerating training and utilizing hardware.
0 commit comments