Skip to content

Latest commit

 

History

History
89 lines (74 loc) · 3.19 KB

README.md

File metadata and controls

89 lines (74 loc) · 3.19 KB

An unofficial reproduction of VALL-E-X

We refer to the repository of https://github.com/Plachtaa/VALL-E-X to open an unofficial reproduction of VALLEX.

Checkpoints

Pretrained model can be found at Google driven.

Decode with checkpoints

# first modify the model_home in following scrip to the location of downloaded/pretrained models.
# second diy the prompt_txt, prompt_audio, target_txt with a corresponding language id
bash examples\\vallex\\scripts\\inference.sh

Data preparation

Vallex is trained on the dataset containing discrete speech tokens and text tokens.

  • Prepare a "info.tsv" file as following (file \t duration), containing speech path and duration of each speech.

    SPEECH_PATH1    DURATION1
    SPEECH_PATH2    DURATION2
    SPEECH_PATH3    DURATION3
    ......
  • Extract Codec according to the "info.tsv"

    bash examples/vallex/data_pretreatment/extract_codec.sh

    We can obtain 8 "codec[i].tsv" files, 0~7 (i)-th layer of codecs are separately saved into "codec[i].tsv"

    304 123 453 255 256 345 124 666 543 ...
    654 662 543 463 674 537 273 473 973 ...
    355 345 766 255 234 768 275 785 102 ...
    ......
  • Prepare the text ("trans.tsv") file with each line corresponding to the speech

    Text for SPEECH1
    Text for SPEECH2
    Text for SPEECH3
    ......

    Next, we need convert the text into tokens via tools like BPE/G2P/..., and it's saved as "st.tsv"

    1521 467 885 2367 242 ...
    2362 3261 356 167 1246 2364 ...
    1246 123 432 134 53 13 ...
    ......
  • Convert data (codec[i].tsv and st.tsv) into binary file for fast reading

    # We use the fairseq tool to achieve this convertion process
    python /home/wangtianrui/codes/fairseq/fairseq_cli/preprocess.py \
        --only-source \
        --trainpref /home/wangtianrui/develop_dataset/st.tsv \
        --destdir /home/wangtianrui/develop_dataset/data_bin  \
        --thresholdsrc 0 \
        --srcdict /home/wangtianrui/develop_dataset/dict.st.txt \
        --workers `cat /proc/cpuinfo| grep "processor"| wc -l`
    
    for ((i=0;i<=7;i++))
    do
    echo $i
    outname=train.at${i}.zh
    python /home/wangtianrui/codes/fairseq/fairseq_cli/preprocess.py \
    --only-source \
    --trainpref codec${i}.tsv \
    --destdir $outdir \
    --thresholdsrc 0 \
    --srcdict /home/wangtianrui/develop_dataset/dict.at.txt \
    --workers `cat /proc/cpuinfo| grep "processor"| wc -l`
    done

    where dict.at.txt and dict.st.txt are simple idx-to-idx rows of speech discrete tokens and text tokens, as shown in examples/vallex/data_pretreatment

In this way, we can train the vallex with the dataset_config.train_data_path set as the home_path of binary files. We also release a tiny dataset for reference at Google driven.

Train a new AR model

After pretreated dataset, modify the "train_data_path" in following script, you can start for your training or finetuning.

bash examples\\vallex\\scripts\\vallex.sh