This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
- Streamable high-fidelity audio codec for 48 kHz mono speech with 12.8 kbps bitrate.
- Very low decoding latency on GPU (~6 ms) and CPU (~10 ms) with 4 threads.
- Efficient two-stage training (with the pre-trained models, training an encoder for new applications takes only a few hours)
A good audio codec for live applications such as telecommunication is characterized by three key properties: (1) compression, i.e. the bitrate that is required to transmit the signal should be as low as possible; (2) latency, i.e. encoding and decoding the signal needs to be fast enough to enable communication without or with only minimal noticeable delay; and (3) reconstruction quality of the signal. In this work, we propose an open-source, streamable, and real-time neural audio codec that achieves strong performance along all three axes: it can reconstruct highly natural sounding 48 kHz speech signals while operating at only 12 kbps and running with less than 6 ms (GPU)/10 ms (CPU) latency. An efficient training paradigm is also demonstrated for developing such neural audio codecs for real-world scenarios. [paper] [demo]
- AutoEncoder (symmetric AudioDec, symAD)
1-1. Train an AutoEncoder-based codec model from scratch with only metric loss(es) for the first 200k iterations.
1-2. Fix the encoder, projector, quantizer, and codebook, and train the decoder with the discriminators for the following 500k iterations. - AutoEncoder + Vocoder (AD v0,1,2) (recommended!)
2-1. Extract the stats (global mean and variance) of the codes extracted by the trained Encoder.
2-2. Train the vocoder with the trained Encoder and stats for 500k iterations.
- 2024/01/03: Update pre-trained models (issue9 and issue11)
- 2023/05/17: Upload the demo sounds on the demo page
- 2023/05/13: 1st version is released
This repository is tested on Ubuntu 20.04 using a V100 and the following settings.
- Python 3.8+
- Cuda 11.0+
- PyTorch 1.10+
- bin: The folder of training, stats extraction, testing, and streaming templates.
- config: The folder of config files (.yaml).
- dataloader: The source codes of data loading.
- exp: The folder for saving models.
- layers: The source codes of basic neural layers.
- losses: The source codes of losses.
- models: The source codes of models.
- slurmlogs: The folder for saving slurm logs.
- stats: The folder for saving stats.
- trainer: The source codes of trainers.
- utils: The source codes of utils for the demo.
- Please download the whole exp folder and put it in the AudioDec project directory.
- Get the list of all I/O devices
$ python -m sounddevice
- Run the demo
# The LibriTTS model is recommended for arbitrary microphones because of the robustness of microphone channel mismatches.
# Set up the I/O devices according to the list of I/O devices
# w/ GPU
$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model libritts_v1
# w/ CPU
$ python demoStream.py --tx_cuda -1 --rx_cuda -1 --input_device 1 --output_device 4 --model libritts_sym
# The input and out audios will be dumped into input.wav and output.wav
- Please download the whole exp folder and put it in the AudioDec project directory.
- Run the demo
## VCTK 48000Hz models
$ python demoFile.py --model vctk_v1 -i xxx.wav -o ooo.wav
## LibriTTS 24000Hz model
$ python demoFile.py --model libritts_v1 -i xxx.wav -o ooo.wav
- Prepare the training/validation/test utterances and put them in three different folders
ex: corpus/train, corpus/dev, and corpus/test - Modify the paths (ex: /mnt/home/xxx/datasets) in
submit_codec_vctk.sh
config/autoencoder/symAD_vctk_48000_hop300.yaml
config/statistic/symAD_vctk_48000_hop300_clean.yaml
config/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean.yaml - Assign corresponding
analyzer
andstats
in config/statistic/symAD_vctk_48000_hop300_clean.yaml
config/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean.yaml - Follow the usage instructions in submit_codec_vctk.sh to run the training and testing
# stage 0: training autoencoder from scratch
# stage 1: extracting statistics
# stage 2: training vocoder from scratch
# stage 3: testing (symAE)
# stage 4: testing (AE + Vocoder)
# Run stages 0-4
$ bash submit_codec.sh --start 0 --stop 4 \
--autoencoder "autoencoder/symAD_vctk_48000_hop300" \
--statistic "stati/symAD_vctk_48000_hop300_clean" \
--vocoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean"
- Prepare the training/validation/test utterances and modify the paths
- Follow the usage instructions in submit_autoencoder.sh to run the training and testing
# Train AutoEncoder from scratch
$ bash submit_autoencoder.sh --stage 0 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"
# Resume AutoEncoder from previous iterations
$ bash submit_autoencoder.sh --stage 1 \
--tag_name "autoencoder/symAD_vctk_48000_hop300" \
--resumepoint 200000
# Test AutoEncoder
$ bash submit_autoencoder.sh --stage 2 \
--tag_name "autoencoder/symAD_vctk_48000_hop300"
--subset "clean_test"
All pre-trained models can be accessed via exp (only the generators are provided).
AutoEncoder | Corpus | Fs | Bitrate | Path |
---|---|---|---|---|
symAD | VCTK | 48 kHz | 24 kbps | exp/autoencoder/symAD_c16_vctk_48000_hop320 |
symAAD | VCTK | 48 kHz | 12.8 kbps | exp/autoencoder/symAAD_vctk_48000_hop300 |
symAD | VCTK | 48 kHz | 12.8 kbps | exp/autoencoder/symAD_vctk_48000_hop300 |
symAD_univ | VCTK | 48 kHz | 12.8 kbps | exp/autoencoder/symADuniv_vctk_48000_hop300 |
symAD | LibriTTS | 24 kHz | 6.4 kbps | exp/autoencoder/symAD_libritts_24000_hop300 |
Vocoder | Corpus | Fs | Path |
---|---|---|---|
AD v0 | VCTK | 48 kHz | exp/vocoder/AudioDec_v0_symAD_vctk_48000_hop300_clean |
AD v1 | VCTK | 48 kHz | exp/vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean |
AD v2 | VCTK | 48 kHz | exp/vocoder/AudioDec_v2_symAD_vctk_48000_hop300_clean |
AD_univ | VCTK | 48 kHz | exp/vocoder/AudioDec_v3_symADuniv_vctk_48000_hop300_clean |
AD v1 | LibriTTS | 24 kHz | exp/vocoder/AudioDec_v1_symAD_libritts_24000_hop300_clean |
- It is easy to perform denoising by just updating the encoder using noisy-clean pairs (keeping the decoder/vocoder the same).
- Prepare the noisy-clean corpus and follow the usage instructions in submit_denoise.sh to run the training and testing
# Update the Encoder for denoising
$ bash submit_denoise.sh --stage 0 \
--tag_name "denoise/symAD_vctk_48000_hop300"
# Denoise
$ bash submit_denoise.sh --stage 2 \
--encoder "denoise/symAD_vctk_48000_hop300"
--decoder "vocoder/AudioDec_v1_symAD_vctk_48000_hop300_clean"
--encoder_checkpoint 200000
--decoder_checkpoint 500000
--subset "noisy_test"
# Stream demo w/ GPU
$ python demoStream.py --tx_cuda 0 --rx_cuda 0 --input_device 1 --output_device 4 --model vctk_denoise
# Codec demo w/ files
$ python demoFile.py -i xxx.wav -o ooo.wav --model vctk_denoise
If you find the code helpful, please cite the following article.
@INPROCEEDINGS{10096509,
author={Wu, Yi-Chiao and Gebru, Israel D. and Marković, Dejan and Richard, Alexander},
booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={{A}udio{D}ec: An Open-Source Streaming High-Fidelity Neural Audio Codec},
year={2023},
doi={10.1109/ICASSP49357.2023.10096509}}
The AudioDec repository is developed based on the following repositories.
- kan-bayashi/ParallelWaveGAN
- r9y9/wavenet_vocoder
- jik876/hifi-gan
- lucidrains/vector-quantize-pytorch
- chomeyama/SiFiGAN
The majority of "AudioDec: An Open-source Streaming High-fidelity Neural Audio Codec" is licensed under CC-BY-NC, however, portions of the project are available under separate license terms: https://github.com/kan-bayashi/ParallelWaveGAN, https://github.com/lucidrains/vector-quantize-pytorch, https://github.com/jik876/hifi-gan, https://github.com/r9y9/wavenet_vocoder, and https://github.com/chomeyama/SiFiGAN are licensed under the MIT license.
- Have you compared AudioDec with Encodec?
Please refer to the discussion. - Have you compared AudioDec with other non-neural-network codecs such as Opus?
Since this paper focuses on providing a well-developed streamable neural codec implementation with an efficient training paradigm and modularized architecture, we only compared AudioDec with SoundStream. - Can you also release the pre-trained discriminators?
For many applications such as denoising, updating only the encoder achieves almost the same performance as updating the whole model. For applications involving decoder updating such as binaural rending, it might be better to design specific discriminators for that application. Therefore, we release only the generators. - Can AudioDec encode/decode multi-channel signals?
Yes, you can train a MIMO model by changing the input_channels and output_channels in the config. One lesson I learned in training a MIMO model is that although the generator is MIMO, reshaping the generator output signal to mono for the following discriminator will markedly improve the MIMO audio quality.