NaturalVoices introduces a novel data-sourcing pipeline alongside the release of a new natural speech dataset for voice conversion (VC). This pipeline leverages proven, high-performance techniques to extract detailed information such as Automatic Speech Recognition (ASR), speaker diarization, and signal-to-noise ratio (SNR) from raw podcast data. Using the pipeline we create a large-scale, spontaneous, expressive, and emotionally rich speech dataset tailored for VC applications. Objective and subjective evaluations demonstrate the effectiveness of using our pipeline for providing natural and expressive data for VC.
The above image is an illustration of our data sourcing pipeline with various modules.To see an overview of audio segments visit the Pages website [website].
The audio files are zipped and uploaded in batches. Each zip file can be unzipped individually and is around 40GB so please ensure you have sufficient free storage and be patient, as the download process may take some time.
The audios will be saved in the audios_zipped
in working directory. To automatically download all the zipped files, please run the following command:
$ bash download_audios.sh
If you wish to manually download a file, please visit this [website].
The previous download instructions are for 16khz, if you wish to download the original sampling rate please use the following command:
$ bash download_audios_original.sh
If you wish to manually download a file, please visit this [website].
The meta-data contains the output of running Faster-Whisper, PyAnnote (Diarization + Voice Activity Detection + Speaker Overlap) and all_data.json which contains the utterance level predictions.
To download the meta-data run the following command:
$ bash download_meta.sh
If you wish to manually download a file, please visit this [website].
After downloading all the files, you should have the following file structure:
NaturalVoices
vad
MSP-PODCAST_0001
...
pyannote
MSP-PODCAST_0001
...
faster-whisper
MSP-PODCAST_0001
...
all_data.json
For an example on how to open and show the meta-data please open the example_code file. In summary: Each file inside the directories is a pickle file that can be loaded in Python using the following code:
def load_pickle(file_path):
with open(file_path, 'rb') as f:
data = pickle.load(f)
return data
The code used to generate the labels is located in pipeline_code. There are three main steps we used to generate NaturalVoices.
Before running the pipeline code, please update the config.py file with the correct pathways (output_path, vad_output_path, etc) for each output folder, as well as, the "auth_key" for pyannote/huggingface.
- Run the podcast level code
- This includes models that predict on the whole audio files
- faster_whisper, pyannote_diarization.ipynb, vad.ipynb
- Create the utterances
- This step uses the segments from whisper to define the utterances
- generate_utt
- Run the utterance level code
- This step contains all remaining predictions
- age_gender, emotional_attributes, emotional_categories, gender, SNR, event_classification, speech_music
- Upload 16KHz raw audio
- Upload ASR output (Faster-Whisper)
- Upload Diarization output (PyAnnote)
- Upload Voice Activity Detection output (PyAnnote)
- Upload speaker overlap output (PyAnnote)
- Upload Gender & Age info
- Upload Signal-to-Noise ratio
- Upload Categorical and Attribute based emotion prediction
- Upload Sound Event predictions
- Upload the pipeline code
- Change from wav to flac to save space
- Upload original sampling rate audios
- Upload newly collected audios
- Upload copyright of audios
- Overlap sections with MSP-PODCAST and MSP-Conversation
To cite this work, please use the following BibTeX entry:
@InProceedings{Salman_2024,
author={A. N. Salman and Z. Du and S. S. Chandra and I. R. Ulgen and and C. Busso and B. Sisman},
title={Towards Naturalistic Voice Conversion: NaturalVoices Dataset with an Automatic Processing Pipeline},
booktitle={Interspeech 2024},
volume={},
year={2024},
month={September},
address = {Kos Island, Greece},
}