Skip to content

Commit

Permalink
update english doc
Browse files Browse the repository at this point in the history
  • Loading branch information
prophesier committed Dec 11, 2022
1 parent 9698dc9 commit a037277
Show file tree
Hide file tree
Showing 2 changed files with 235 additions and 12 deletions.
28 changes: 16 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,38 +21,42 @@ Singing Voice Conversion via diffusion model
2022.10.24 支持对自定义数据集的训练,并精简代码\
2022.10.22 完成对opencpop数据集的训练并创建仓库

## 注意事项:
## 注意事项/Notes
>本项目是基于学术交流目的建立,并非为生产环境准备,不对由此项目模型产生的任何声音的版权问题负责。\
如将本仓库代码二次分发,或将由此项目产出的任何结果公开发表(包括但不限于视频网站投稿),请注明原作者及代码来源(此仓库)。\
如果将此项目用于任何其他企划,请提前联系并告知本仓库作者,十分感谢。\
>This repository was established based on the purpose of acadamic exchange, not for production environment. And this repository is not responsible for any copyright issues associated with the output of it.\
If you distribute/publish either the code or the output of the model, please cite this repository.\
If you wish to utilize this repo as part of your project, please inform the author in advance, thank you.
>This project is established for academic exchange purposes and is not intended for production environments. We are not responsible for any copyright issues arising from the sound produced by this project's model. \
If you redistribute the code in this repository or publicly publish any results produced by this project (including but not limited to video website submissions), please indicate the original author and source code (this repository). \
If you use this project for any other plans, please contact and inform the author of this repository in advance. Thank you very much.

## 推理:
## 推理/inference

>查看./inference.ipynb

## 预处理:
## 预处理/preprocessing:
```
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python preprocessing/binarize.py --config training/config.yaml
```
## 训练:
## 训练/training:
```
CUDA_VISIBLE_DEVICES=0 python run.py --config training/config.yaml --exp_name [your project name] --reset
```
详细训练过程和各种参数介绍请查看[推理与训练说明](./doc/train_and_inference.markdown)
### 已训练模型
详细训练过程和各种参数介绍请查看[推理与训练说明](./doc/train_and_inference.markdown)\
Please refer to the [Inference and Training Instructions](./doc/training_and_inference_EN.markdown) for a detailed training process and introduction to various parameters.Thank you for the translation provided by @ρoem.
### 已训练模型/trained models
>目前本项目已在众多数据集进行过训练和测试。部分ckpt文件、demo音频和推理训练所需的其他文件请在下方QQ频道内下载\
使用QQ扫描此二维码(如不能加入,请尝试一个合适的网络环境):
<img src="./ckpt.jpg" width=256/>
This project has been trained and tested on many datasets. You can download the ckpt files, demo audio, and other files required for inference and training in the QQ channel below by using QQ to scan this QR code (if you cannot join, please try a suitable network environment).\
<img src="./ckpt.jpg" width=256/>\
For English support, you can join this discord:

[![Discord](https://img.shields.io/discord/1044927142900809739?color=%23738ADB&label=Discord&style=for-the-badge)](https://discord.gg/jvA5c2xzSE)

## Acknowledgements
>项目基于[diffsinger](https://github.com/MoonInTheRiver/DiffSinger)[diffsinger(openvpi维护版)](https://github.com/openvpi/DiffSinger)[soft-vc](https://github.com/bshall/soft-vc)开发.\
同时也十分感谢openvpi成员在开发训练过程中给予的帮助。
>注意:此项目与同名论文[DiffSVC](https://arxiv.org/abs/2105.13871)无任何联系,请勿混淆!
同时也十分感谢openvpi成员在开发训练过程中给予的帮助。\
This project is based on [diffsinger](https://github.com/MoonInTheRiver/DiffSinger), [diffsinger (openvpi maintenance version)](https://github.com/openvpi/DiffSinger), and [soft-vc](https://github.com/bshall/soft-vc). We would also like to thank the openvpi members for their help during the development and training process.
>注意:此项目与同名论文[DiffSVC](https://arxiv.org/abs/2105.13871)无任何联系,请勿混淆!\
Note: This project has no connection with the paper of the same name [DiffSVC](https://arxiv.org/abs/2105.13871), please do not confuse them!
219 changes: 219 additions & 0 deletions doc/training_and_inference_EN.markdown
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
# Diff-SVC(train/inference by yourself)
## 0. Setting up the environment
>Notice: The requirements files have been updated, and there are now three versions to choose from.
1. requirements.txt contains the entire environment during development and testing. It includes Torch1.12.1+cu113, and you can use pip to install it directly or remove the packages related to PyTorch inside (torch/torchvision) and then use pip to install it and use your own torch environment.
```
pip install -r requirements.txt
```
2. **(Recommended)**: `requirements_short.txt` is a manually organized version of the one above but does not include torch itself. You can also just run the code below:
```
pip install -r requirements_short.txt
```
3. There is a requirements list (requirements.png) compiled by @三千 under the project's root directory, which was tested on a certain brand's cloud servers. However, its torch version is NOT compatible with the latest code anymore, but the versions of the other requirements can be used as a reference.

## 1. Inference
>You can use `inference.ipynb` in the project's root directory or use `infer.py` written by @IceKyrin and adapted by the author for inference.\
Edit the parameters below in the first block:
```
config_path= 'location of config.yaml in the checkpoints archive'
# E.g.: './checkpoints/nyaru/config.yaml'
# The config and checkpoints are one-to-one correspondences. Please do not use other config files.
project_name='name of the current project'
# E.g.: 'nyaru'
model_path='full path to the ckpt file'
# E.g.: './checkpoints/nyaru/model_ckpt_steps_112000.ckpt'
hubert_gpu=True
# Whether or not to use GPU for HuBERT (a module in the model) during inference. It will not affect any other parts of the model.
# The current version significantly reduces the GPU usage for inferencing the HuBERT module. As full inference can be made on a 1060 6G GPU, there is no need to turn it off.
# Also, auto-slice of long audio is now supported (both inference.ipynb and infer.py support this). Audio longer than 30 seconds will be automatically sliced at silences, thanks to @IceKyrin's code.
```
### Adjustable parameters:
```
wav_fn='xxx.wav'
# The path to the input audio. The default path is in the project's root directory.
use_crepe=True
# CREPE is an F0 extraction algorithm. It has good performance but is slow. Changing this to False will use the slightly inferior but much faster Parselmouth algorithm.
thre=0.05
# CREPE's noise filtering threshold. It can be increased if the input audio is clean, but if the input audio is noisy, keep this value or decrease it. This parameter will have no effect if the previous parameter is set to False.
pndm_speedup=20
# Inference acceleration multiplier. The default number of diffusion steps is 1000, so changing this value to 10 means synthesizing in 100 steps. The default, 20, is a moderate value. This value can go up to 50x (synthesizing in 20 steps) without obvious loss in quality, but any higher may result in a significant quality loss. Note: if use_gt_mel below is enabled, make sure this value is lower than add_noise_step. This value should also be divisible by the number of diffusion steps.
key=0
# Transpose parameter. The default value is 0 (NOT 1!!). The pitch from the input audio will be shifted by {key} semitones, then synthesized. For example, to change a male voice to a female voice, this value can be set to 8 or 12, etc. (12 is to shift a whole octave up).
use_pe=True
# F0 extraction algorithm for synthesizing audio from the Mel spectrogram. Changing this to False will use the input audio's F0.
# There is a slight difference in results between using True and False. Usually, setting it to True is better, but not always. It has almost no effect on the synthesizing speed.
# (Regardless of what the value of the key parameter is, this value is always changeable and does not affect it)
# This function is not supported in 44.1kHz models and will be turned off automatically. Leaving it on will not cause any errors as well.
use_gt_mel=False
# This option is similar to the image-to-image function in AI painting. If set to True, the output audio will be a mix of the input and target speaker's voices, with the mix ratio determined by the parameter below.
# NOTE!!!: If this parameter is set to true, make sure the key parameter is set to 0 since transposing is not supported here.
add_noise_step=500
# Related to the previous parameter, it controls the ratio of the input and target voice. A value of 1 will be entirely the input voice, and a value of 1000 will be entirely the target voice. A value of around 300 will result in a roughly equal mixture of the two. (This value is not linear; if this parameter is set to a very low value, you can lower pndm_speedup for higher synthesis quality)
wav_gen='yyy.wav'
# The path to the output audio. The default is in the project's root directory. The file type can be changed by changing the file extension here.
```

If using infer.py, the way to change parameters is similar. Change values inside `__name__=='__main__'`, then run `python infer.py` in the project's root directory.
This method requires putting the input audio under raw/ and the output will under results/.

## 2. Data preparation and training
### 2.1 Data preparation
>Currently, both WAV and Ogg format audio are supported. The sampling rate is better to be higher than 24kHz. The program will automatically handle issues with sampling rates and the number of channels. The sampling rate should not be lower than 16kHz (which usually will not). \
The audio is better to be sliced into segments of 5-15 seconds. While there is no specific requirement for the audio length, it is best for them not to be too long or too short. The audio needs to be the target speaker's dry vocals without background music or other voices, preferably without excessive reverb, etc. If the audio is processed through vocal extraction, please try to keep the audio quality as high as possible. \
Currently, only single-speaker training is supported. The total audio duration should be 3 hours or above. No additional labeling is required. Just place the audio files under raw_data_dir described below. The structure of this directory does not matter; the program will locate the files by itself.

### 2.2 Editing hyperparameters
>First, make a backup copy of config.yaml (this file is for the 24kHz vocoder; use config_nsf.yaml for the 44.1kHz vocoder), then edit it: \
The parameters below might be used (using project name `nyaru` as an example):
```
K_step: 1000
# The total number of diffusion steps. Changing this is not recommended.
binary_data_dir: data/binary/nyaru
# The path to the pre-processed data: the last part needs to be changed to the current project name.
config_path: training/config.yaml
# The path to this config.yaml itself that you are using. Since data will be written into this file during the pre-processing process, this must be the full path to where the yaml file will be stored.
choose_test_manually: false
# Manually selecting a test set. It is disabled by default, and the program will automatically randomly select 5 audio files as the test set.
# If set to true, enter the prefixes of the filenames of the test files in test_prefixes. The program will use the files starting with the corresponding prefix(es) as the test set.
# This is a list and can contain multiple prefixes, e.g.
test_prefixes:
- test
- aaaa
- 5012
- speaker1024
# IMPORTANT: the test set CAN NOT be empty. To avoid unintended effects, it is recommended to avoid manually selecting the test set.
endless_ds:False
# If your dataset is too small, each epoch will pass very fast. Setting this to True will treat 1000 epochs as a single one.
hubert_path: checkpoints/hubert/hubert.pt
# The path to the HuBERT model, make sure this path is correct. In most cases, the decompressed checkpoints.zip archive would put the model under the right path, so no edits are needed. The torch version is now used for inference.
hubert_gpu:True
# Whether or not to use GPU for HuBERT (a module in the model) during pre-processing. If set to False, CPU will be used, and the processing time will increase significantly. Note that whether GPU is used during inference is controlled separately in inference and not affected by this. Since HuBERT changed to the torch version, it is possible to run pre-processing and inference audio under 1 minute without exceeding VRAM limits on a 1060 6G GPU now, so it is usually not necessary to set it to False.
lr: 0.0008
# Initial learning rate: this value corresponds to a batch size of 88; if the batch size is smaller, you can lower this value a bit.
decay_steps: 20000
# For every 20,000 steps, the learning rate will decay to half the original. If the batch size is small, please increase this value.
# For a batch size of about 30-40, the recommended values are lr=0.0004,decay_steps=40000
max_frames: 42000
max_input_tokens: 6000
max_sentences: 88
max_tokens: 128000
# The batch size is calculated dynamically based on these parameters. If unsure about their exact meaning, you can change the max_sentences parameter only, which sets the maximum limit for the batch size to avoid exceeding VRAM limits.
pe_ckpt: checkpoints/0102_xiaoma_pe/model_ckpt_steps_60000.ckpt
# Path to the pe model. Make sure this file exists. Refer to the inference section for its purpose.
raw_data_dir: data/raw/nyaru
# Path to the directory of the raw data before pre-processing. Please put the raw audio files under this directory. The structure inside does not matter, as the program will automatically parse it.
residual_channels: 384
residual_layers: 20
# A group of parameters that control the core network size. The higher the values, the more parameters the network has and the slower it trains, but this does not necessarily lead to better results. For larger datasets, you can change the first parameter to 512. You can experiment with them on your own. However, it is best to leave them as they are if you are not sure what you are doing.
speaker_id: nyaru
# The name of the target speaker. Currently, only single-speaker is supported. (This parameter is for reference only and has no functional impact)
use_crepe: true
# Use CREPE to extract F0 for pre-processing. Enable it for better results, or disable it for faster processing.
val_check_interval: 2000
# Inference on the test set and save checkpoints every 2000 steps.
vocoder_ckpt:checkpoints/0109_hifigan_bigpopcs_hop128
# For 24kHz models, this should be the path to the directory of the corresponding vocoder. For 44.1kHz models, this should be the path to the corresponding vocoder file itself. Be careful, do not put the wrong one.
work_dir: checkpoints/nyaru
# Change the last part to the project name. (Or it can also be deleted or left completely empty to generate this directory automatically, but do not put some random names)
no_fs2: true
# Simplification of the network encoder. It can reduce the model size and speed up training. No direct evidence of damage to the network performance has been found so far. Enabled by default.
```
> Do not edit the other parameters if you do not know that they do, even if you think you may know by judging from their names.
### 2.3 Data pre-processing
Run the following commands under the diff-svc directory: \
#windows
```
set PYTHONPATH=.
set CUDA_VISIBLE_DEVICES=0
python preprocessing/binarize.py --config training/config.yaml
```
#linux
```
export PYTHONPATH=.
CUDA_VISIBLE_DEVICES=0 python preprocessing/binarize.py --config training/config.yaml
```
For pre-processing, @IceKyrin has prepared a code for processing HuBERT and other features separately. If your VRAM is insufficient to do it normally, you can run `python ./network/hubert/hubert_model.py` first and then run the pre-processing commands, which can recognize the pre-processed HuBERT features.

### 2.4 Training
#windows
```
set CUDA_VISIBLE_DEVICES=0
python run.py --config training/config.yaml --exp_name nyaru --reset
```
#linux
```
CUDA_VISIBLE_DEVICES=0 python run.py --config training/config.yaml --exp_name nyaru --reset
```
>You need to change `exp_name` to your project name and edit the config path. Please make sure that the config file used for training is the same as the one used for pre-processing.\
*Important*: After finishing training (on the cloud), if you did not pre-process the data locally, you will need to download the corresponding ckpt file AND the config file for inference. Do not use the one on your local machine since pre-processing writes data into the config file. Make sure the config file used for inference is the same as the one from pre-processing.

### 2.5 Possible issues:

>**2.5.1 'Upsample' object has no attribute 'recompute_scale_factor'**\
This issue was found in the torch version corresponding to cuda 11.3. If this issue occurs, please locate the `torch.nn.modules.upsampling.py` file in your python package (for example, in a conda environment, it is located under conda_dir\envs\environment_dir\Lib\site-packages\torch\nn\modules\upsampling.py), edit line 153-154 from
```
return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners,recompute_scale_factor=self.recompute_scale_factor)
```
>to
```
return F.interpolate(input, self.size, self.scale_factor, self.mode, self.align_corners)
# recompute_scale_factor=self.recompute_scale_factor)
```

>**2.5.2 no module named 'utils'**\
Please set up in your runtime environment (such as colab notebooks) as follows:
```
import os
os.environ['PYTHONPATH']='.'
!CUDA_VISIBLE_DEVICES=0 python preprocessing/binarize.py --config training/config.yaml
```
Note that this must be done in the project's root directory.

>**2.5.3 cannot load library 'libsndfile.so'**\
This is an error that may occur in a Linux environment. Please run the following command:
```
apt-get install libsndfile1 -y
```
>**2.5.4 cannot load import 'consume_prefix_in_state_dict_if_present'**\
The current torch version is too old. Please upgrade to a higher version of torch.

>**2.5.5 Data pre-processing being too slow**\
Check if `use_crepe` is enabled in config. Turning it off can significantly increase speed.\
Check if `hubert_gpu` is enabled in config.

If there are any other questions, feel free to join the QQ channel or Discord server to ask.

0 comments on commit a037277

Please sign in to comment.