Skip to content

Commit 006dad3

Browse files
authored
Finish documentation for v2.2.0 release (#156)
* Finish ConfigurationSchemas.md for v2.2.0 release * Add detailed instructions for shallow diffusion * Update link to DiffScope
1 parent 961be9a commit 006dad3

File tree

3 files changed

+309
-17
lines changed

3 files changed

+309
-17
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ This is a refactored and enhanced version of _DiffSinger: Singing Voice Synthesi
2525
- **Dataset creation pipelines & tools**: See [MakeDiffSinger](https://github.com/openvpi/MakeDiffSinger)
2626
- **Best practices & tutorials**: See [Best Practices](docs/BestPractices.md)
2727
- **Editing configurations**: See [Configuration Schemas](docs/ConfigurationSchemas.md)
28-
- **Deployment & production**: [OpenUTAU for DiffSinger](https://github.com/xunmengshe/OpenUtau), [DiffScope (under development)](https://github.com/SineStriker/qsynthesis-revenge)
28+
- **Deployment & production**: [OpenUTAU for DiffSinger](https://github.com/xunmengshe/OpenUtau), [DiffScope (under development)](https://github.com/openvpi/diffscope)
2929
- **Communication groups**: [QQ Group](http://qm.qq.com/cgi-bin/qm/qr?_wv=1027&k=fibG_dxuPW5maUJwe9_ya5-zFcIwaoOR&authKey=ZgLCG5EqQVUGCID1nfKei8tCnlQHAmD9koxebFXv5WfUchhLwWxb52o1pimNai5A&noverify=0&group_code=907879266) (907879266), [Discord server](https://discord.gg/wwbu2JUMjj)
3030

3131
## Progress & Roadmap

docs/BestPractices.md

Lines changed: 89 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -211,16 +211,103 @@ pe_ckpt: checkpoints/rmvpe/model.pt
211211
[Harvest](https://github.com/mmorise/World) (Harvest: A high-performance fundamental frequency estimator from speech signals) is the recommended pitch extractor from Masanori Morise's WORLD, a free software for high-quality speech analysis, manipulation and synthesis. It is a state-of-the-art algorithmic pitch estimator designed for speech, but has seen use in singing voice synthesis. It runs the slowest compared to the others, but provides very accurate F0 on clean and normal recordings compared to parselmouth.
212212

213213
To use Harvest, simply include the following line in your configuration file:
214+
214215
```yaml
215216
pe: harvest
216217
```
217218

218219
**Note:** It is also recommended to change the F0 detection range for Harvest with accordance to your dataset, as they are hard boundaries for this algorithm and the defaults might not suffice for most use cases. To change the F0 detection range, you may include or edit this part in the configuration file:
220+
221+
```yaml
222+
f0_min: 65 # Minimum F0 to detect
223+
f0_max: 800 # Maximum F0 to detect
224+
```
225+
226+
## Shallow diffusion
227+
228+
Shallow diffusion is a mechanism that can improve quality and save inference time for diffusion models that was first introduced in the original DiffSinger [paper](https://arxiv.org/abs/2105.02446). Instead of starting the diffusion process from purely gaussian noise as classic diffusion does, shallow diffusion adds a shallow gaussian noise on a low-quality results generated by a simple network (which is called the auxiliary decoder) to skip many unnecessary steps from the beginning. With the combination of shallow diffusion and sampling acceleration algorithms, we can get better results under the same inference speed as before, or achieve higher inference speed without quality deterioration.
229+
230+
Currently, acoustic models in this repository support shallow diffusion. The main switch of shallow diffusion is `use_shallow_diffusion` in the configuration file, and most arguments of shallow diffusion can be adjusted under `shallow_diffusion_args`. See [Configuration Schemas](ConfigurationSchemas.md) for more details.
231+
232+
### Train full shallow diffusion models from scratch
233+
234+
To train a full shallow diffusion model from scratch, simply introduce the following settings in your configuration file:
235+
236+
```yaml
237+
use_shallow_diffusion: true
238+
K_step: 400 # adjust according to your needs
239+
K_step_infer: 400 # should be <= K_step
240+
```
241+
242+
Please note that when shallow diffusion is enabled, only the last $K$ diffusion steps will be trained. Unlike classic diffusion models which are trained on full steps, the limit of `K_step` can make the training more efficient. However, `K_step` should not be set too small because without enough diffusion depth (steps), the low-quality auxiliary decoder results cannot be well refined. 200 ~ 400 should be the proper range of `K_step`.
243+
244+
The auxiliary decoder and the diffusion decoder shares the same linguistic encoder, which receives gradients from both the decoders. In some experiments, it was found that gradients from the auxiliary decoder will cause mismatching between the encoder and the diffusion decoder, resulting in the latter being unable to produce reasonable results. To prevent this case, a configuration item called `aux_decoder_grad` is introduced to apply a scale factor on the gradients from the auxiliary decoder during training. To adjust this factor, introduce the following in the configuration file:
245+
246+
```yaml
247+
shallow_diffusion_args:
248+
aux_decoder_grad: 0.1 # should not be too high
249+
```
250+
251+
### Train auxiliary decoder and diffusion decoder separately
252+
253+
Training a full shallow diffusion model can consume more memory because the auxiliary decoder is also in the training graph. In limited situations, the two decoders can be trained separately, i.e. train one decoder after another.
254+
255+
**STEP 1: train the diffusion decoder**
256+
257+
In the first stage, the linguistic encoder and the diffusion decoder is trained together, while the auxiliary decoder is left unchanged. Edit your configuration file like this:
258+
219259
```yaml
220-
f0_min: 65 # Minimum F0 to detect
221-
f0_max: 800 # Maximum F0 to detect
260+
use_shallow_diffusion: true # make sure the main option is turned on
261+
shallow_diffusion_args:
262+
train_aux_decoder: false # exclude the auxiliary decoder from the training graph
263+
train_diffusion: true # train diffusion decoder as normal
264+
val_gt_start: true # should be true because the auxiliary decoder is not trained yet
222265
```
223266

267+
Start training until `max_updates` is reached, or until you get satisfactory results on the TensorBoard.
268+
269+
**STEP 2: train the auxiliary decoder**
270+
271+
In the second stage, the auxiliary decoder is trained besides the linguistic encoder and the diffusion decoder. Edit your configuration file like this:
272+
273+
```yaml
274+
shallow_diffusion_args:
275+
train_aux_decoder: true
276+
train_diffusion: false # exclude the diffusion decoder from the training graph
277+
lambda_aux_mel_loss: 1.0 # no more need to limit the auxiliary loss
278+
```
279+
280+
Then you should freeze the encoder to prevent it from getting updates. This is because if the encoder changes, it no longer matches with the diffusion decoder, thus making the latter unable to produce correct results again. Edit your configuration file:
281+
282+
```yaml
283+
freezing_enabled: true
284+
frozen_params:
285+
- model.fs2 # the linguistic encoder
286+
```
287+
288+
You should also manually reset your learning rate scheduler because this is a new training process for the auxiliary decoder. Possible ways are:
289+
290+
1. Rename the latest checkpoint to `model_ckpt_steps_0.ckpt` and remove the other checkpoints from the directory.
291+
2. Increase the initial learning rate (if you use a scheduler that decreases the LR over training steps) so that the auxiliary decoder gets proper learning rate.
292+
293+
Additionally, `max_updates` should be adjusted to ensure enough training steps for the auxiliary decoder.
294+
295+
Once you finished the configurations above, you can resume the training. The auxiliary decoder normally does not need many steps to train, and you can stop training when you get stable results on the TensorBoard. Because this step is much more complicated than the previous step, it is recommended to run some inference to verify if the model is trained properly after everything is finished.
296+
297+
### Add shallow diffusion to classic diffusion models
298+
299+
Actually, all classic DDPMs have the ability to be "shallow". If you want to add shallow diffusion functionality to a former classic diffusion model, the only thing you need to do is to train an auxiliary decoder for it.
300+
301+
Before you start, you should edit the configuration file to ensure that you use the same datasets, and that you do not remove or add any of the functionalities of the old model. Then you can configure the old checkpoint in your configuration file:
302+
303+
```yaml
304+
finetune_enabled: true
305+
finetune_ckpt_path: xxx.ckpt # path to your old checkpoint
306+
finetune_ignored_params: [] # do not ignore any parameters
307+
```
308+
309+
Then you can follow the instructions in STEP 2 of the [previous section](#add-shallow-diffusion-to-classic-diffusion-models) to finish your training.
310+
224311
## Performance tuning
225312

226313
This section is about accelerating training and utilizing hardware.

0 commit comments

Comments
 (0)