Stars
Speech enhancement using Wiener filtering and pitch-synchronous STFT phase reconstruction
The MOS system combines components from DNSMOS, NISQA, MOSSSL, and SIGMOS, using the librosa library to process audio waveforms.
Denoising Diffusion Probabilistic Models
Score-based Generative Models (Diffusion Models) for Speech Enhancement and Dereverberation
DiffWave is a fast, high-quality neural vocoder and waveform synthesizer.
State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.
chazo1994 / Amphion
Forked from open-mmlab/AmphionAmphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
An Open-source Streaming High-fidelity Neural Audio Codec
HeCheng0625 / AmphionOpen
Forked from open-mmlab/AmphionAmphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
FunCodec is a research-oriented toolkit for audio quantization and downstream applications, such as text-to-speech synthesis, music generation et.al.
High-Resolution Image Synthesis with Latent Diffusion Models
A latent text-to-image diffusion model
Audio generation using diffusion models, in PyTorch.
A fully working pytorch implementation of NaturalSpeech (Tan et al., 2022)
Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch
iSTFTNet : Fast and Lightweight Mel-spectrogram Vocoder Incorporating Inverse Short-time Fourier Transform
Source code for "Taming Visually Guided Sound Generation" (Oral at the BMVC 2021)
On Variational Learning of Controllable Representations for Text without Supervision https://arxiv.org/abs/1905.11975
Audio Generation model working with GPT-2 and VQVAE compressed representation of MelSpectrograms
Automatic Speaker Recognition (ASR) system using Mel Frequency Cepstral Coefficients (MFCC's) and Vector Quantization (VQ)
Front-end speech processing aims at extracting proper features from short- term segments of a speech utterance, known as frames. It is a pre-requisite step toward any pattern recognition problem em…
Implementation of "MOSNet: Deep Learning based Objective Assessment for Voice Conversion"