open-mmlab · RMSnow · Jul 9, 2024 · Jun 22, 2024 · Jun 22, 2024 · Jun 22, 2024
diff --git a/README.md b/README.md
@@ -25,15 +25,11 @@
 - **TTM**: Text to Music (👨‍💻 developing)
 - more…
 
-In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks.
-
-Here is the Amphion v0.1 demo, whose voice, audio effects, and singing voice are generated by our models. Just enjoy it!
-
-[amphion-v0.1-en](https://github.com/open-mmlab/Amphion/assets/24860155/7fcdcea5-3d95-4b31-bd93-4b4da734ef9b
-)
+In addition to the specific generation tasks, Amphion includes several **vocoders** and **evaluation metrics**. A vocoder is an important module for producing high-quality audio signals, while evaluation metrics are critical for ensuring consistent metrics in generation tasks. Moreover, Amphion is dedicated to advancing audio generation in real-world applications, such as building **large-scale datasets** for speech synthesis.
 
 ## 🚀 News
-- **2024/6/17**: Amphion has a new release for its VALL-E models, it uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
+- **2024/07/01**: Amphion now releases **Emilia**, the first open-source multilingual in-the-wild dataset for speech generation with over 101k hours of speech data, and the **Emilia-Pipe**, the first open-source preprocessing pipeline designed to transform in-the-wild speech data into high-quality training data with annotations for speech generation! [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia) [![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](preprocessors/Emilia/README.md)
+- **2024/06/17**: Amphion has a new release for its **VALL-E** model! It uses Llama as its underlying architecture and has better model performance, faster training speed, and more readable codes compared to our first version. [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/tts/VALLE_V2/README.md)
 - **2024/03/12**: Amphion now support **NaturalSpeech3 FACodec** and release pretrained checkpoints. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2403.03100) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-model-yellow)](https://huggingface.co/amphion/naturalspeech3_facodec) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-demo-pink)](https://huggingface.co/spaces/amphion/naturalspeech3_facodec) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](models/codec/ns3_codec/README.md)
 - **2024/02/22**: The first Amphion visualization tool, **SingVisio**, release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2402.12660) [![openxlab](https://cdn-static.openxlab.org.cn/app-center/openxlab_app.svg)](https://openxlab.org.cn/apps/detail/Amphion/SingVisio) [![Video](https://img.shields.io/badge/Video-Demo-orange)](https://github.com/open-mmlab/Amphion/assets/33707885/0a6e39e8-d5f1-4288-b0f8-32da5a2d6e96) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](egs/visualization/SingVisio/README.md)
 - **2023/12/18**: Amphion v0.1 release. [![arXiv](https://img.shields.io/badge/arXiv-Paper-<COLOR>.svg)](https://arxiv.org/abs/2312.09911) [![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Amphion-pink)](https://huggingface.co/amphion) [![youtube](https://img.shields.io/badge/YouTube-Demo-red)](https://www.youtube.com/watch?v=1aw0HhcggvQ) [![readme](https://img.shields.io/badge/README-Key%20Features-blue)](https://github.com/open-mmlab/Amphion/pull/39)
@@ -79,7 +75,8 @@ Amphion provides a comprehensive objective evaluation of the generated audio. Th
 
 ### Datasets
 
-Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating).
+- Amphion unifies the data preprocess of the open-source datasets including [AudioCaps](https://audiocaps.github.io/), [LibriTTS](https://www.openslr.org/60/), [LJSpeech](https://keithito.com/LJ-Speech-Dataset/), [M4Singer](https://github.com/M4Singer/M4Singer), [Opencpop](https://wenet.org.cn/opencpop/), [OpenSinger](https://github.com/Multi-Singer/Multi-Singer.github.io), [SVCC](http://vc-challenge.org/), [VCTK](https://datashare.ed.ac.uk/handle/10283/3443), and more. The supported dataset list can be seen [here](egs/datasets/README.md) (updating). 
+- Amphion (exclusively) supports the [**Emilia**](preprocessors/Emilia/README.md) dataset and its preprocessing pipeline **Emilia-Pipe** for in-the-wild speech data!
 
 ### Visualization
 

diff --git a/preprocessors/Emilia/README.md b/preprocessors/Emilia/README.md
@@ -0,0 +1,165 @@
+## Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation
+[![arXiv](https://img.shields.io/badge/arXiv-Paper-COLOR.svg)](https://arxiv.org/abs/2407.05361) 
+[![hf](https://img.shields.io/badge/%F0%9F%A4%97%20HuggingFace-Dataset-yellow)](https://huggingface.co/datasets/amphion/Emilia) 
+[![demo](https://img.shields.io/badge/WebPage-Demo-red)](https://emilia-dataset.github.io/Emilia-Demo-Page/)
+
+This is the official repository 👑 for the **Emilia** dataset and the source code for **Emilia-Pipe** speech data preprocessing pipeline. 
+
+## News 🔥
+- **2024/07/08**: Our preprint [paper](https://arxiv.org/abs/2407.05361) is now available! 🔥🔥🔥
+- **2024/07/03**: We welcome everyone to check our [homepage](https://emilia-dataset.github.io/Emilia-Demo-Page/) for our brief introduction for Emilia dataset and our demos!
+- **2024/07/01**: We release of Emilia and Emilia-Pipe! We welcome everyone to explore it! 🎉🎉🎉
+
+## About ⭐️
+🎤 **Emilia** is a comprehensive, multilingual dataset with the following features:
+- containing over *101k* hours of speech data;
+- covering six different languages: *English (En), Chinese (Zh), German (De), French (Fr), Japanese (Ja), and Korean (Ko)*;
+- containing diverse speech data with *various speaking styles*;
+
+Detailed description for the dataset could be found in our paper.
+
+🛠️ **Emilia-Pipe** is the first open-source preprocessing pipeline designed to transform raw, in-the-wild speech data into high-quality training data with annotations for speech generation. This pipeline can process one hour of raw audio into model-ready data in just a few minutes, requiring only the URLs of the audio or video sources. 
+
+*To use the Emilia dataset, you can download the raw audio files from the [provided URL list](https://huggingface.co/datasets/amphion/Emilia) and use our open-source [Emilia-Pipe](https://github.com/open-mmlab/Amphion/tree/main/preprocessors/Emilia) preprocessing pipeline to preprocess the raw data and rebuild the dataset. Please note that Emilia doesn't own the copyright of the audios; the copyright remains with the original owners of the video or audio. Additionally, users can easily use Emilia-Pipe to preprocess their own raw speech data for custom needs.*
+
+By open-sourcing the Emilia-Pipe code, we aim to enable the speech community to collaborate on large-scale speech generation research.
+
+This following README will introduce the installation and usage guide of the Emilia-Pipe.
+
+## Pipeline Overview 👀
+
+The Emilia-Pipe includes the following major steps:
+
+0. Standardization：Audio normalization
+1. Source Separation: Long audio -> Long audio without BGM
+2. Speaker Diarization: Get medium-length single-speaker speech data
+3. Fine-grained Segmentation by VAD: Get 3-30s single-speaker speech segments
+4. ASR: Get transcriptions of the speech segments
+5. Filtering: Obtain the final processed dataset
+
+## Setup Steps 👨‍💻
+
+### 0. Prepare Environment
+
+1. Install Python and CUDA.
+2. Run the following commands to install the required packages:
+
+    ```bash
+    conda create -y -n AudioPipeline python=3.9 
+    conda activate AudioPipeline
+
+    bash env.sh
+    ```
+
+3. Download the model files from the third-party repositories.
+    - Manually download the checkpoints of UVR-MDX-NET-Inst_HQ_3 ([UVR-MDX-NET-Inst_3.onnx](https://github.com/TRvlvr/model_repo/releases/download/all_public_uvr_models/UVR-MDX-NET-Inst_HQ_3.onnx)) and DNSMOS P.835 ([sig_bak_ovr.onnx](https://github.com/microsoft/DNS-Challenge/blob/master/DNSMOS/DNSMOS/sig_bak_ovr.onnx)), then save their path for the next step configuration (i.e. #2  and #3 TODO).
+    - Creat the access token to pyannote/speaker-diarization-3.1 following [the guide](https://huggingface.co/pyannote/speaker-diarization-3.1#requirements), then save it for the next step configuration (i.e. #4 TODO).
+    - Make sure you have stable connection to GitHub and HuggingFace. The checkpoints of Silero and Whisperx-medium will be downloaded automatically on the pipeline's first run. 
+
+
+### 1. Modify Config File
+
+Change the config.json file according to the following TODOs.
+
+```json
+{
+    "language": {
+        "multilingual": true,
+        "supported": [
+            "zh",
+            "en",
+            "fr",
+            "ja",
+            "ko",
+            "de"
+        ]
+    },
+    "entrypoint": {
+        // TODO: Fill in the input_folder_path. 
+        "input_folder_path": "examples", // #1: Data input folder for processing
+        "SAMPLE_RATE": 24000
+    },
+    "separate": {
+        "step1": {
+            // TODO: Fill in the source separation model's path. 
+            "model_path": "/path/to/model/separate_model/UVR-MDX-NET-Inst_HQ_3.onnx", // #2: Model path
+            "denoise": true,
+            "margin": 44100,
+            "chunks": 15,
+            "n_fft": 6144,
+            "dim_t": 8,
+            "dim_f": 3072
+        }
+    },
+    "mos_model": {
+        // TODO: Fill in the DNSMOS prediction model's path. 
+        "primary_model_path": "/path/to/model/mos_model/DNSMOS/sig_bak_ovr.onnx" // #3: Model path
+    },
+     // TODO: Fill in your huggingface access token for pynannote. 
+    "huggingface_token": "<HUGGINGFACE_ACCESS_TOKEN>" // #4: Huggingface access token for pyannote
+}
+```
+
+### 2. Run Script
+
+1. Change the `input_folder_path` in `config.json` to the folder path where the downloaded audio files are stored (i.e. #1 TODO).
+2. Run the following command to process the audio files:
+
+```bash
+conda activate AudioPipeline
+export CUDA_VISIBLE_DEVICES=0  # Setting the GPU to run the pipeline, separate by comma
+
+python main.py
+```
+
+3. Processed audio will be saved into `input_folder_path`_processed folder.
+
+
+### 3. Check the Results
+
+The processed audio (default 24k sample rate) files will be saved into `input_folder_path`_processed folder. The results for a single audio will be saved in a same folder with its original name and include the following information:
+
+1. **MP3 file**: `<original_name>_<idx>.mp3` where `idx` is corresponding to the index in the JSON-encoded array.
+2. **JSON file**: `<original_name>.json`
+
+```json
+[
+    {
+        "text": "So, don't worry about that. But, like for instance, like yesterday was very hard for me to say, you know what, I should go to bed.", // Transcription
+        "start": 67.18, // Start timestamp, in second unit
+        "end": 74.41, // End timestamp, in second unit
+        "language": "en", // Language
+        "dnsmos": 3.44 // DNSMOS P.835 score
+    }
+]
+```
+
+## Acknowledgement 🔔
+We acknowledge the wonderful work by these excellent developers!
+- Source Separation: [UVR-MDX-NET-Inst_HQ_3](https://github.com/TRvlvr/model_repo/releases/tag/all_public_uvr_models)
+- VAD: [snakers4/silero-vad](https://github.com/snakers4/silero-vad)
+- Speaker Diarization: [snakers4/silero-vad](https://github.com/snakers4/silero-vad)
+- ASR: [m-bain/whisperX](https://github.com/m-bain/whisperX)
+- DNSMOS Prediction: [DNSMOS P.835](https://github.com/microsoft/DNS-Challenge)
+
+
+## Reference 📖
+If you use the Emilia dataset or the Emilia-Pipe pipeline, please cite the following papers:
+```bibtex
+@article{emilia,
+      title={Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation},
+      author={He, Haorui and Shang, Zengqiang and Wang, Chaoren and Li, Xuyuan and Gu, Yicheng and Hua, Hua and Liu, Liwei and Yang, Chen and Li, Jiaqi and Shi, Peiyang and Wang, Yuancheng and Chen, Kai and Zhang, Pengyuan and Wu, Zhizheng},
+      journal={arXiv},
+      volume={abs/2407.05361}
+      year={2024}
+}
+```
+```bibtex
+@article{amphion,
+      title={Amphion: An Open-Source Audio, Music and Speech Generation Toolkit}, 
+      author={Zhang, Xueyao and Xue, Liumeng and Gu, Yicheng and Wang, Yuancheng and He, Haorui and Wang, Chaoren and Chen, Xi and Fang, Zihao and Chen, Haopeng and Zhang, Junan and Tang, Tze Ying and Zou, Lexiao and Wang, Mingxuan and Han, Jun and Chen, Kai and Li, Haizhou and Wu, Zhizheng},
+      journal={arXiv},
+      volume={abs/2312.09911}
+      year={2024},
+}
+```
diff --git a/preprocessors/Emilia/config.json b/preprocessors/Emilia/config.json
@@ -0,0 +1,35 @@
+{
+    "language": {
+        "multilingual": true,
+        "supported": [
+            "zh",
+            "en",
+            "fr",
+            "ja",
+            "ko",
+            "de"
+        ]
+    },
+    "entrypoint": {
+        // TODO: Fill in the input_folder_path. 
+        "input_folder_path": "examples",
+        "SAMPLE_RATE": 24000
+    },
+    "separate": {
+        "step1": {
+            // TODO: Fill in the source separation model's path. 
+            "model_path": "/path/to/model/separate_model/UVR-MDX-NET-Inst_HQ_3.onnx",
+            "denoise": true,
+            "margin": 44100,
+            "chunks": 15,
+            "n_fft": 6144,
+            "dim_t": 8,
+            "dim_f": 3072
+        }
+    },
+    "mos_model": {
+        // TODO: Fill in the DNSMOS prediction model's path. 
+        "primary_model_path": "/path/to/model/mos_model/DNSMOS/sig_bak_ovr.onnx"
+    },
+    "huggingface_token": "<HUGGINGFACE_ACCESS_TOKEN>"
+}
diff --git a/preprocessors/Emilia/env.sh b/preprocessors/Emilia/env.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+# Copyright (c) 2024 Amphion.
+#
+# This source code is licensed under the MIT license found in the
+# LICENSE file in the root directory of this source tree.
+
+conda install ffmpeg -y
+conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia -y
+pip install -r requirements.txt
+pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/