Skip to content

Latest commit

 

History

History
980 lines (636 loc) · 55.2 KB

Audio.md

File metadata and controls

980 lines (636 loc) · 55.2 KB

CNN音频分割(音乐/人声/性别)工具集 https://github.com/ina-foss/inaSpeechSegmenter

最强CNN语音识别算法开源了:词错率5%,训练超快,Facebook出品 https://github.com/facebookresearch/wav2letter

轻量语音识别解码框架 https://github.com/robin1001/xdecoder

Espresso:快速端到端神经网络语音识别工具集 https://github.com/freewym/espresso

PyTorch音频处理工具/数据集 https://github.com/audeering/audtorch

AI音源分离:Facebook AI的Demucs项目帮机器像人一样听音乐 https://github.com/facebookresearch/demucs

Youka:基于spleeter音源分离的卡拉OK生成工具 https://github.com/youkaclub/youka-desktop

基于卷积网络的基音检测 https://0xfe.blogspot.com/2020/02/pitch-detection-with-convolutional.html?m=1

Implementation of "FastSpeech: Fast, Robust and Controllable Text to Speech" https://github.com/Deepest-Project/FastSpeech

'基于Kaldi的aidatatang_200zh的训练之葵花宝典' https://github.com/datatang-ailab/aidatatang_200zh/blob/master/README.zh.md

【轻量快速语音合成】’LightSpeech - A Light, Fast and Robust Speech Synthesis.'

https://github.com/xcmyz/lightspeech

DeepSpectrum:基于预训练图像CNN的音频数据特征抽取工具包 https://github.com/DeepSpectrum/DeepSpectrum

Landmark音频指纹 https://github.com/dpwe/audfprint

基于Kaldi/Tensorflow的神经网络说话人识别/鉴别系统 https://github.com/mycrazycracy/tf-kaldi-speaker

【PyTorch语音识别框架】’patter - speech-to-text framework in PyTorch with initial support for the DeepSpeech2 architecture https://github.com/ryanleary/patter

(语音)说话人分割相关资源大列表 https://github.com/wq2012/awesome-diarization

Audio samples from ICML2019 "Almost Unsupervised Text to Speech and Automatic Speech Recognition" https://github.com/SpeechResearch/speechresearch.github.io

(PyTorch)Seq2Seq普通话Transformer语音识别 https://github.com/ZhengkunTian/Speech-Tranformer-Pytorch

Deep neural network based speech enhancement toolkit https://github.com/jtkim-kaist/Speech-enhancement

音乐音频标记预训练深度网络模型 https://github.com/jordipons/musiCNN

End-to-End Automatic Speech Recognition on PyTorch https://github.com/gentaiscool/end2end-asr-pytorch

(Pytorch)音源分离语音信号提取 https://github.com/AppleHolic/source_separation

Code and models for evaluating a state-of-the-art lip reading network https://github.com/afourast/deep_lip_reading

声音模仿秀:5秒钟实时克隆任意语音

https://github.com/CorentinJ/Real-Time-Voice-Cloning

Program to benchmark various speech recognition APIs https://github.com/Franck-Dernoncourt/ASR_benchmark

基于Transformer的TTS语音合成模型 https://github.com/xcmyz/Transformer-TTS

DIY智能音箱(资源列表) https://github.com/voice-engine/make-a-smart-speaker/blob/master/zh.md

用深度学习实时克隆别人的声音 https://towardsdatascience.com/you-can-now-speak-using-someone-elses-voice-with-deep-learning-8be24368fa2b

用卷积网络从立体声音乐中分离乐器 https://towardsdatascience.com/audio-ai-isolating-instruments-from-stereo-music-using-convolutional-neural-networks-584ababf69de

用卷积神经网络从立体声音乐中分离人声 https://towardsdatascience.com/audio-ai-isolating-vocals-from-stereo-music-using-convolutional-neural-networks-210532383785

面向下一代交互设备的开源语音交互操作系统 https://github.com/yodaos-project/yodaos

笑声检测器 https://github.com/ideo/LaughDetection

'ASRT_SpeechRecognition - A Deep-Learning-Based Chinese Speech Recognition System 基于深度学习的中文语音识别系统' by nl8590687 https://github.com/nl8590687/ASRT_SpeechRecognition

用卷积神经网络从立体声音乐中分离人声 https://towardsdatascience.com/audio-ai-isolating-vocals-from-stereo-music-using-convolutional-neural-networks-210532383785

A Pytorch Implementation of "Neural Speech Synthesis with Transformer Network" https://github.com/soobinseo/Transformer-TTS

This is research-code for Synthesizing Obama: Learning Lip Sync from Audio. https://github.com/supasorn/synthesizing_obama_network_training

Voice Operated Character Animation https://voca.is.tue.mpg.de/en https://github.com/TimoBolkart/voca

Deezer 的(Tensorflow)音源分离库,可用命令行直接提取音乐中的人声、钢琴、鼓声等 https://github.com/deezer/spleeter

【开源语音分离/增强库】 https://github.com/speechLabBcCuny/onssen

Feature extractor for DL speech processing. https://github.com/bepierre/SpeechVGG

Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data https://github.com/KunZhou9646/Nonparallel-emotional-VC

This is a PyTorch re-implementation of Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. https://github.com/foamliu/Speech-Transformer

【Athena:开源端到端语音识别引擎】 https://github.com/athena-team/athena

PREDICTING EXPRESSIVE SPEAKING STYLE FROM TEXT IN END-TO-END SPEECH SYNTHESIS https://github.com/Yangyangii/TPGST-Tacotron

PyTorch implementation of LF-MMI for End-to-end ASR https://github.com/YiwenShaoStephen/pychain

Audio samples from ICML2019 "Almost Unsupervised Text to Speech and Automatic Speech Recognition" https://github.com/RayeRen/unsuper_tts_asr

Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention. https://github.com/CSTR-Edinburgh/ophelia

Efficient neural speech synthesis https://github.com/MlWoo/LPCNet

Code for Vision-Infused Deep Audio Inpainting (ICCV 2019) https://github.com/Hangz-nju-cuhk/Vision-Infused-Audio-Inpainter-VIAI

deep learning based speech enhancement using keras or pytorch https://github.com/yongxuUSTC/sednn

Multi-voice singing voice synthesis https://github.com/MTG/WGANSing

【用涂鸦“唱歌”:将图像合成为声音】 https://github.com/jeonghopark/SketchSynth-Simple

Feature extractor for DL speech processing. https://github.com/bepierre/SpeechVGG

【面向语音识别的中文/英文发音辞典】’ https://github.com/speech-io/BigCiDian

【Kaldi/TensorFlow实现的神经网络说话人验证系统】

https://github.com/someonefighting/tf-kaldi-speaker-master

Facebook开源低延迟在线语音识别框架wav2letter

https://github.com/facebookresearch/wav2letter/wiki/Inference-Framework

【GridSound:在线数字音频编辑器】 https://github.com/GridSound/daw

【Asteroid:基于PyTorch的音源分离工具集】 https://github.com/mpariente/ASSteroid

【MelGAN 超快音频合成】 https://github.com/descriptinc/melgan-neurips

用深度学习生成钢琴音乐 https://github.com/haryoa/note_music_generator

音频分析/音乐检索相关数据集大列表 https://www.audiocontentanalysis.org/data-sets/

【用TensorRT在GPU上部署实时文本-语音合成应用】

https://devblogs.nvidia.com/how-to-deploy-real-time-text-to-speech-applications-on-gpus-using-tensorrt/

用WaveNet让语音受损用户重拾原声(少样本自适应自然语音合成) https://deepmind.com/blog/article/Using-WaveNet-technology-to-reunite-speech-impaired-users-with-their-original-voices

(C++)音频文件波形图生成 https://github.com/bbc/audiowaveform

【时域卷积DeepFake变音检测】 https://github.com/dessa-public/fake-voice-detection

Athena:(Tensorflow)端到端自动语音识别引擎开源实现 https://github.com/didi/athena

SV2TTS https://github.com/CorentinJ/Real-Time-Voice-Cloning

【GPU上的特定领域自动语音识别模型】《How to Build Domain Specific Automatic Speech Recognition Models on GPUs》

https://devblogs.nvidia.com/how-to-build-domain-specific-automatic-speech-recognition-models-on-gpus/

【(音频)数字信号处理入门(Notebooks)】 https://github.com/earthspecies/from_zero_to_DSP

【at16k:Python语音识别库】’at16k - Trained models for automatic speech recognition (ASR). A library to quickly build applications that require speech to text conversion.'

https://github.com/at16k/at16k

一维卷积网络音频处理 https://github.com/KinWaiCheuk/nnAudio

CRF数据高效端到端语音识别工具集 https://github.com/thu-spmi/CAT

【音乐波形域音源分离】’Music Source Separation in the Waveform Domain - source separation in the waveform domain for music' https://github.com/facebookresearch/demucs

【Python实时音频频谱分析器】’Realtime_PyAudio_FFT - Realtime audio analysis in Python, using PyAudio and Numpy to extract and visualize FFT features from streaming audio.' https://github.com/tr1pzz/Realtime_PyAudio_FFT

Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis https://github.com/NVIDIA/flowtron

【文本语音合成(TTS)文献集】 https://github.com/erogol/TTS-papers

【CPU高性能实时文本语音合成系统】 https://ai.facebook.com/blog/a-highly-efficient-real-time-text-to-speech-system-deployed-on-cpus/

【TensorFlow 2 实现的文本语音合成】 https://github.com/as-ideas/TransformerTTS

【语音增强/语音分离/音源分离相关资源大列表】 https://github.com/Wenzhe-Liu/awesome-speech-enhancement

【AudioMass:全功能网页版音频/波形编辑工具 https://github.com/pkalogiros/AudioMass

【CTC端到端语音识别&语料库】’CTC-based Automatic Speech Recogniton - CTC end -to-end ASR for timit and 863 corpus.' https://github.com/Diamondfan/CTC_pytorch

【TensorflowTTS:Tensorflow 2实现的最先进实时语音合成】 https://github.com/dathudeptrai/TensorflowTTS

【audio:面向语音行为检测、二值化、说话人识别、自动语音识别、情感识别等任务的音频标注工具】 https://github.com/midas-research/audino

【深度学习语音端点检测】 https://github.com/filippogiruzzi/voice_activity_detection

【用Kaldi快速训练语音识别系统】 https://github.com/JRMeyer/easy-kaldi

从一首歌的mp3中分离得到人声、谱子、各种乐器等,转化成符号表示 https://github.com/deezer/spleeter

'aukit - 语音处理工具箱,包含语音降噪、音频格式转换、特征频谱生成等模块' https://github.com/KuangDD/aukit

【Keras示例:说话人识别】《Speaker Recognition》 https://keras.io/examples/audio/speaker_recognition_using_cnn/

基于RNN-Transducer的在线语音识别系统 https://github.com/theblackcat102/Online-Speech-Recognition

'TacotronV2 + WaveRNN - tacotronV2 + wavernn 实现中文语音合成(Tensorflow + pytorch)'

https://github.com/lturing/tacotronv2_wavernn_chinese

【miniaudio:C语言单文件音频回放/采集库】 https://github.com/dr-soft/miniaudio https://github.com/irmen/pyminiaudio

【TiramisuASR:用Tensorflow 2实现的语音识别引擎】 https://github.com/usimarit/TiramisuASR

A PyTorch implementation of dual-path RNNs (DPRNNs) based speech separation described in "Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation". https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation

Zero-Shot Multi-Speaker Text-To-Speech with State-of-the-art Neural Speaker Embeddings https://github.com/nii-yamagishilab/multi-speaker-tacotron

Code repo for ICME 2020 paper "Style-Conditioned Music Generation". VAE model that allows style-conditioned music generation. https://github.com/daQuincy/DeepMusicvStyle

streaming attention networks for end-to-end automatic speech recognition https://github.com/HaoranMiao/streaming-attention

[InterSpeech 2020] "AutoSpeech: Neural Architecture Search for Speaker Recognition" https://github.com/TAMU-VITA/AutoSpeech

Pytorch implementation of sparse_image_warp and an example of GoogleBrain's SpecAugment is given: A Simple Data Augmentation Method for Automatic Speech Recognition https://github.com/bobchennan/sparse_image_warp_pytorch

A Generative Flow for Text-to-Speech via Monotonic Alignment Search https://github.com/jaywalnut310/glow-tts

DeCoAR (self-supervised contextual representations for speech recognition) https://github.com/awslabs/speech-representations

A pytorch implementation of the EATS: End-to-End Adversarial Text-to-Speech https://github.com/yanggeng1995/EATS

Companion repository for the paper "A Comparison of Metric Learning Loss Functions for End-to-End Speaker Verification" https://github.com/juanmc2005/SpeakerEmbeddingLossComparison

Melody extraction using joint detection and classification network https://github.com/keums/melodyExtraction_JDC

Implementation of the AlignTTS https://github.com/Deepest-Project/AlignTTS

An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech" https://github.com/ming024/FastSpeech2

It's an naive implementation of Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech. https://github.com/AppleHolic/multiband_melgan

PyTorch Implementation of FastSpeech 2 : Fast and High-Quality End-to-End Text to Speech https://github.com/rishikksh20/FastSpeech2

GELP: GAN-Excited Linear Prediction https://github.com/ljuvela/GELP

CASR-DEMO(中文自动语音识别演示系统) - 基于Flask Web的中文自动语音识别演示系统,包含语音识别、语音合成、声纹识别之说话人识别

https://github.com/lihanghang/CASR-DEMO

Unofficial PyTorch implementation of Multi-Band MelGAN paper https://github.com/rishikksh20/melgan

中文语音识别 - Chinese speech recognition

https://github.com/chenmingxiang110/Chinese-automatic-speech-recognition

MASR 中文语音识别:端到端深度神经网络中文普通话语音识别项目

https://github.com/nobody132/masr

Plover:开源跨平台速记引擎,每分钟可录入200+单词 http://www.openstenoproject.org/plover/

【“Python机器学习声源分离”源码】 https://github.com/masahitotogami/python_source_separation

【可移植的C语言声学指纹库】 https://github.com/JorenSix/Olaf

SpeedySpeech:师生网络高质量实时语音合成系统 https://github.com/janvainer/speedyspeech

音频/语音预训练模型集 https://github.com/balavenkatesh3322/audio-pretrained-model

Piano transcription:钢琴曲MIDI文件转写工具 https://arxiv.org/abs/2010.01815 https://github.com/bytedance/piano_transcription

CorentinJ/Real-Time-Voice-Cloning https://github.com/KuangDD/zhrtvc

工业级语音识别文献集(Streaming ASR / Non-autoregressive ASR / WFST based ASR ...) https://github.com/xingchensong/speech-recognition-papers

pyttsx3:Python离线语音合成库 https://github.com/nateshmbhat/pyttsx3

TensorflowASR:Tensorflow2实现的最先进语音识别 https://github.com/Z-yq/TensorflowASR

Cornell鸟鸣识别比赛第二名方案 https://github.com/vlomme/Birdcall-Identification-competition

micmon:从原始音频流分割创建音频数据集并训练声音检测模型的Python库 https://github.com/BlackLight/micmon

LibreASR:开箱即用的流语音识别系统(基于PyTorch&fastai )

https://github.com/iceychris/LibreASR

Voicenet:语音和音频的综合Python处理库 https://github.com/Robofied/Voicenet

musicpy:音乐编程语言,用简洁的语法通过乐理逻辑写出优美音乐 https://github.com/Rainbow-Dreamer/musicpy

SOVA ASR:基于Wav2Letter架构的快速语音识别API https://github.com/sovaai/sova-asr

ZhTTS:CPU上的开源端到端实时中文语音合成系统

https://github.com/Jackiexiao/zhtts

SeeWav: 音频波形可视化包 https://github.com/adefossez/seewav

神经网络语音分离必读文献列表 https://github.com/JusperLee/Speech-Separation-Paper-Tutorial

PIKA: 基于Pytorch和(Py)Kaldi的轻量语音处理工具包 https://github.com/tencent-ailab/pika

ESP-Skainet:智能语音助手,支持唤醒词识别和命令词识别

https://github.com/espressif/esp-skainet

TensorFlow Lite (TFLite)的TTS模型集

https://github.com/tulasiram58827/TTS_TFLite

Elpis (Accelerated Transcription):开发中的语音识别模型创建工具 https://github.com/CoEDL/elpis

MusicNet:带标注的古典音乐数据集(330+),标注了每个音符的精确时间,演奏每个音符的乐器,以及这些音符在乐曲韵律结构中的位置 https://homes.cs.washington.edu/~thickstn/musicnet.html

AI音乐生成 https://alxmamaev.medium.com/generating-music-with-ai-or-transformers-go-brrrr-3a3ac5a04126

Real Time Speech Enhancement in the Waveform Domain (Interspeech 2020)We provide a PyTorch implementation of the paper Real Time Speech Enhancement in the Waveform Domain. https://github.com/facebookresearch/denoiser

Implementation of MelNet in PyTorch to generate high-fidelity audio samples https://github.com/jgarciapueyo/MelNet-SpeechGeneration

PPSpeech: Phrase based Parallel End-to-End TTS System https://github.com/rishikksh20/PPSpeech

Implementation of Phase-aware speech enhancement with deep complex U-Net https://github.com/mhlevgen/DCUNetTorchSound

Tensorflow 2.0 implementation of the paper: A Fully Convolutional Neural Network for Speech Enhancement https://github.com/daitan-innovation/cnn-audio-denoiser

Voice Conversion by CycleGAN (语音克隆/语音转换): CycleGAN-VC2

https://github.com/jackaduma/CycleGAN-VC2

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis https://github.com/jik876/hifi-gan

Real-Time High-Fidelity Speech Synthesis without GPU https://github.com/BogiHsu/WG-WaveNet

Official PyTorch implementation of Speaker Conditional WaveRNN https://github.com/dipjyoti92/SC-WaveRNN

Pytorch implementation of "Efficienttts: an efficient and high-quality text-to-speech architecture" https://github.com/liusongxiang/efficient_tts

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis https://github.com/rishikksh20/HiFi-GAN

An unofficial implementation of the paper "One-shot Voice Conversion by Separating Speaker and Content Representations with Instance Normalization". https://github.com/cyhuang-tw/AdaIN-VC

TFGAN: Time and Frequency Domain Based Generative Adversarial Network for High-fidelity Speech Synthesis https://github.com/rishikksh20/TFGAN

Efficient neural networks for analog audio effect modeling https://github.com/csteinmetz1/micro-tcn

End-to-End Multi-Channel Transformer for Speech Recognition https://arxiv.org/abs/2102.03951

Hugging Face的Transformers v4.3.0最新发布,hub模型库增加Facebook的Wav2Vec2自动语音识别模型 https://huggingface.co/facebook/wav2vec2-base-960h

LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search https://arxiv.org/abs/2102.04040

End-to-end Audio-visual Speech Recognition with Conformers https://arxiv.org/abs/2102.06657

TTS_TFLite

https://github.com/tulasiram58827/TTS_TFLite

Memory-efficient Speech Recognition on Smart Devices https://arxiv.org/abs/2102.11531

自监督学习语音识别,wav2vec 2.0框架封装版 https://github.com/mailong25/self-supervised-speech-recognition

音频自动描述相关资源列表 https://github.com/audio-captioning/audio-captioning-resources

PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)

https://github.com/sooftware/conformer

OpenASR:基于Pytorch的端到端语音识别系统 https://github.com/by2101/OpenASR

实时音频频谱生成(网页版) https://borismus.github.io/spectrogram/

【WavEncoder:PyTorch后端的原始音频编码库】 https://github.com/shangeth/wavencoder

【Picovoice:用于大规模语音产品构建的端到端平台】 github.com/Picovoice/picovoice

audlib:以深度学习为重点的Python语音信号处理库 https://github.com/raymondxyy/pyaudlib

Auto-Editor:命令行视频/音频自动编辑工具,自动切除静默部分 https://github.com/WyattBlue/auto-editor

The SpeechBrain Toolkit:PyTorch开源一体化语音工具包,可用来轻松开发最先进的语音系统,包括语音识别、讲话者识别、语音增强、多麦克风信号处理等 github.com/speechbrain/speechbrain

STT:用于训练和部署语音到文本模型的开源深度学习工具包 github.com/coqui-ai/STT

GigaSpeech:用于语音识别的大型现代数据集 github.com/SpeechColab/GigaSpeech

Desed dataset:家庭环境声音事件检测数据集与工具 github.com/turpaultn/DESED

'MASR中文语音识别(pytorch版) - 中文语音识别系列,读者可以借助它快速训练属于自己的中文语音识别模型,或直接使用预训练模型测试效果。

github.com/binzhouchn/masr

端到端语音处理工具集 github.com/espnet/espnet ​​​ ​​​​

Spleeter语声分离Demo github.com/deezer/spleeter

基于Tacotron 2 & Waveglow的多说话人情感文本语音合成(TTS) github.com/ide8/tacotron2

《AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss》(2019) github.com/cyhuang-tw/AutoVC

教程(Colab):从头开始语音识别 github.com/speechbrain/speechbrain/ ​​​​

Vosk-Browser:运行在浏览器里的语音识别库(基于WebAssembly) github.com/ccoreilly/vosk-browser ccoreilly.github.io/vosk-browser/

The SpeechBrain Toolkit:PyTorch开源一体化语音工具包,可用来轻松开发最先进的语音系统,包括语音识别、讲话者识别、语音增强、多麦克风信号处理等 speechbrain.github.io/

Speech Algorithms:语音算法集 github.com/Ryuk17/SpeechAlgorithms

torchsynth:面向音频机器学习研究的指出GPU的超快模块音频合成器,在GPU上合成音频的速度比实时(714mhz)快16200倍 github.com/torchsynth/torchsynth

MevonAI:语音情感识别 github.com/SuyashMore/MevonAI-Speech-Emotion-Recognition

TensorVox:C++写的桌面神经网络语音合成应用

github.com/ZDisket/TensorVox

Music Demixing Challenge - Starter Kit:音乐音源分离挑战入门工具包 github.com/AIcrowd/music-demixing-challenge-starter-kit

LEAF:轻量嵌入式音频框架,用于音频合成和处理的C语言库 github.com/spiricom/LEAF

Word2Wave:基于WaveGAN和COALA的文本音频生成框架 github.com/ilaria-manco/word2wave

基于深度学习的音-视语音增强和分离相关资源集 github.com/danmic/av-se

LAS_Mandarin_PyTorch:端到端的中文语音识别 github.com/jackaduma/LAS_Mandarin_PyTorch

基于PaddlePaddle实现的中文语音识别 github.com/yeyupiaoling/PaddlePaddle-DeepSpeech

PyTorch实现的DNN音源分离 github.com/tky823/DNN-based_source_separation

可在线演示(Colab)的企业级预训练多语言语音识别(STT)模型 https://pytorch.org/hub/snakers4_silero-models_stt/

openspeech:用PyTorch-Lightning和Hydra实现的的端到端语音识别开源工具包 github.com/sooftware/openspeech

Audio Augmentations:PyTorch音频增强库 github.com/Spijkervet/torchaudio-augmentations

FRILL:用TensorFlow-Lite实现设备端语音表示 https://arxiv.org/abs/2011.04609 https://ai.googleblog.com/2021/06/frill-on-device-speech-representations.html

DeepPhonemizer:基于Transformer模型的字音转换库,可用于高精度和高效率的文本语音转换生产系统 github.com/as-ideas/DeepPhonemizer

《HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis》

github.com/rishikksh20/multiband-hifigan

《VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech》(2021) github.com/jaywalnut310/vits

基于wav2vec2的自动语音识别 github.com/oliverguhr/wav2vec2-live

CoreAudioML:音频效果处理机器学习库 github.com/Alec-Wright/CoreAudioML

SoundPy:面向研究的语音/声音Python开发包 github.com/a-n-rose/Python-Sound-Tool

kaldifeat:PyTorch的Kaldi兼容特征抽取,支持CUDA & autograd github.com/csukuangfj/kaldifeat

ttskit - 语音合成工具箱,Text To Speech Toolkit,多种音色可供选择的语音合成工具。 github.com/KuangDD/ttskit

Chinese mandarin text to speech (MTTS) 中文 (普通话) 语音 合成 , by fastspeech 2 , implemented in pytorch, using waveglow as vocoder, with biaobei and aishell3 datasets #TODO github.com/ranchlai/mandarin-tts

Common Voice Dataset:开源、多语言语音数据集 github.com/common-voice/cv-dataset

语音合成技术百花齐放,一篇综述带你全面梳理 https://weibo.com/ttarticle/p/show?id=2309404668701587932163

Larynx:基于gruut & onnx的端到端文本语音合成系统

github.com/rhasspy/larynx

Realtime-Voice-Clone-Chinese - AI拟声: 克隆您的声音并生成任意语音内容

github.com/babysor/Realtime-Voice-Clone-Chinese

Speech Emotion Recognition:用 LSTM、CNN、SVM、MLP 进行语音情感识别,Keras 实现 github.com/Renovamen/Speech-Emotion-Recognition

ParallelTTS:快速语音合成模型,适用于英语、普通话/中文、日语、韩语、俄语和藏语(当前已测试) github.com/atomicoo/ParallelTTS

Neural HMMs are all you need (for high-quality attention-free TTS) https://arxiv.org/abs/2108.13320

praudio:面向深度学习音频应用的音频预处理框架 github.com/musikalkemist/praudio

为言语障碍人士合成自然语音的PnG NAT 模型 https://ai.googleblog.com/2021/08/recreating-natural-voices-for-people.html

大规模多样无序语音数据集的个性化语音识别模型 https://ai.googleblog.com/2021/09/personalized-asr-models-from-large-and.html

功能齐全的语音工具包:SpeechBrain,提供语音识别(支持普通话)、语音增强、语音处理、多麦克风信号处理、模块化定制等功能。 此外,该工具还提供了颇为齐全的教程文档,以便帮助开发者更好的入门语音识别技术。 github.com/speechbrain/speechbrain/ ​​​​

Music Demixing Challenge 2021音源分离比赛第四名方案 github.com/yoyololicon/music-demixing-challenge-ismir-2021-entry

Keras实例:CTC自动语音识别 https://keras.io/examples/audio/ctc_asr/

mdx-tutorial:音源分离开源工具教程 github.com/kuielab/mdx-tutorial

wenet-kws:面向产品的端到端唤醒关键词检测工具包

github.com/wenet-e2e/wenet-kws

Wav2Vec2 STT Python:基于Wav2Vec2.0的语音识别库 github.com/daanzu/wav2vec2_stt_python

PPASR流式与非流式语音识别 - 基于PaddlePaddle2实现的端到端中文语音识别框架

github.com/yeyupiaoling/PPASR

music2video:基于Wav2CLIP和VQGAN-CLIP根据音乐自动生成视频 github.com/joeljang/music2video

compound-word-transformer-tensorflow - Tensorflow 实现的AI作曲’ compound-word-transformer-tensorflow

MockingBird - AI拟声: 5秒内克隆您的声音并生成任意语音内容 github.com/babysor/MockingBird

Wenet STT Python:基于WeNet的Python语音识别库 github.com/daanzu/wenet_stt_python

Spchcat:面向Linux/树莓派的语音识别工具,用于将音频转录为文本

github.com/petewarden/spchcat

Open Audio Search:开源音频搜索引擎(基于语音识别) github.com/openaudiosearch/openaudiosearch

语音识别资源集锦 https://wiki.nikitavoloboev.xyz/nlp/speech-recognition

HuggingSound:基于HuggingFace工具包的语音相关任务工具包 github.com/jonatasgrosman/huggingsound

Muskit: 聚焦于端到端歌唱合成基准测试的开源音乐处理工具包,用PyTorch作为深度学习引擎,并遵循 ESPnet 和 Kaldi 风格的数据处理,为各种音乐处理实验提供完整设置 github.com/SJTMusicTeam/Muskits

基于很少样本的神经乐器克隆 https://erlj.notion.site/Neural-Instrument-Cloning-from-very-few-samples-2cf41d8b630842ee8c7eb55036a1bfd6

PaddleSpeech:基于飞桨PaddlePaddle的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型 github.com/PaddlePaddle/PaddleSpeech

WeNet:面向工业落地应用的语音识别工具包,提供了从语音识别模型的训练到部署的一条龙服务 github.com/wenet-e2e/wenet

IMS-Toucan:支持最新模型的语音合成工具包 github.com/DigitalPhonetics/IMS-Toucan

NeuralSpeech:微软亚研院的研究项目,专注于基于神经网络的语音处理,包括自动语音识别(ASR)、文本到语音转换(TTS)等 github.com/microsoft/NeuralSpeech

Tensorflow 2实现的最先进实时语音合成 github.com/TensorSpeech/TensorflowTTS ​​​​

Awesome Keyword Spotting:语音关键字检测(唤醒词检测)论文列表 github.com/zycv/awesome-keyword-spotting

ocotillo - A fast, accurate and super simple speech recognition model - Performant and accurate speech recognition built on Pytorch github.com/neonbjb/ocotillo

libspecbleach:C语言音频降噪库 github.com/lucianodato/libspecbleach

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality 提出了完全端到端文本波形生成系统NaturalSpeech,首次在LJSpeech数据集上实现了人类水平的质量。 https://arxiv.org/abs/2205.04421

sherpa:支持流式和非流式识别的Python语音识别服务框架 github.com/k2-fsa/sherpa

audio-preview:VS Code的wav音频文件预览与播放扩展 github.com/sukumo28/vscode-audio-preview

【WeNet:面向工业落地应用的语音识别工具包,提供了从语音识别模型的训练到部署的一条龙服务】’WeNet - Production First and Production Ready End-to-End Speech Recognition Toolkit' by WeNet Open Source Community GitHub: github.com/wenet-e2e/wenet paper:《WeNet: Production First and Production Ready End-to-End Speech Recognition Toolkit》

【WeTTS:产品级端到端文本语音合成工具包】’WeTTS - Production First and Production Ready End-to-End Text-to-Speech Toolkit' by WeNet Open Source Community GitHub: github.com/wenet-e2e/wetts

【音频编码教程资料】’Audio Coding Video Tutorials and Python Notebooks - Audio Coding Notebooks and Tutorials' by Guitars.AI GitHub: github.com/GuitarsAI/AudioCodingTutorials

'wenet_trt8 - 用TRT8部署开源语音识别工具包WeNet,为语音识别模型在TRT8上部署提供参考方案’ by huismiling GitHub: github.com/huismiling/wenet_trt8

'FastASR - 基于PaddleSpeech所使用的conformer模型,使用C++的高效实现模型推理,在树莓派4B等ARM平台运行也可流畅运行' by chenkui164 GitHub: github.com/chenkui164/FastASR

【Open Text to Speech Server:开源多语言文本语音合成服务器】’Open Text to Speech Server - Open Text to Speech Server' by Michael Hansen GitHub: github.com/synesthesiam/opentts

GitHub 上的开源技术教程:《语音增强初探》,主要讲解语音增强技术相关的技术解析,以及模型应用。 GitHub:github.com/WenzheLiu-Speech/The-guidebook-of-speech-enhancement ​​​​

【StemRoller:免费的音源分离工具,可从从歌曲中分离出人声、鼓声、贝斯和其他乐器声部】’StemRoller - Isolate vocals, drums, bass, and other instrumental stems from any song' by StemRoller GitHub: github.com/stemrollerapp/stemroller

【口语语言识别相关文献资源列表】’Awesome-Spoken-Language-Identification - An awesome spoken LID repository. (Working in progress' by HexinHexin GitHub: github.com/Lhx94As/Awesome-Spoken-Language-Identification

【KAN-TTS:训练自己的TTS语音合成模型】’KAN-TTS - With KAN-TTS you can train your own TTS model from zero to hero’ by Alibaba Research GitHub: github.com/AlibabaResearch/KAN-TTS

【语音合成、文字转语音(TTS)、歌唱声音合成(SVS)、声音转换(VC)、歌唱声音转换(SVC)等相关论文项目列表】’Awesome Singing Voice Synthesis and Singing Voice Conversion - A paper and project list about the cutting edge Speech Synthesis, Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Singing Voice Conversion (SVC), and related interesting works.' by GYChen GitHub: github.com/guan-yuan/Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion

'VITS+BigVGAN+SpanPSP 中文TTS - 基于PyTorch的VITS-BigVGAN的tts中文模型,加入韵律预测模型' by Zz-ww GitHub: github.com/Zz-ww/VITS-BigVGAN-SpanPSP-Chinese

【Open Speech Corpora:面向ASR, TTS和其他语音技术的开放语音数据集列表】’Open Speech Corpora - A list of accessible speech corpora for ASR, TTS, and other Speech Technologies' by coqui GitHub: github.com/coqui-ai/open-speech-corpora

【(Interspeech 2022 Tutorial)神经语音合成】'Neural Speech Synthesis' by Xu Tan, Hung-yi Lee GitHub: github.com/tts-tutorial/interspeech2022

'Streamlit Custom Component that enables recording audio from the client's mic in apps that are deployed to the web. (via browser Media-API, REACT-based)' by Stefan Rummer GitHub: github.com/stefanrmmr/streamlit_audio_recorder

'sherpa-ncnn - Real-time speech recognition using next-gen Kaldi with ncnn' by k2-fsa GitHub: github.com/k2-fsa/sherpa-ncnn

'MASR流式与非流式语音识别项目 - Pytorch实现的流式与非流式的自动语音识别框架,同时兼容在线和离线识别,目前支持DeepSpeech2模型,支持多种数据增强方法' by yeyupiaoling GitHub: github.com/yeyupiaoling/MASR

'streamlit-stt-app - Real time web based Speech-to-Text app with Streamlit' by Yuichiro Tachibana (Tsuchiya) GitHub: github.com/whitphx/streamlit-stt-app

【Whisper:OpenAI开源的通用语音识别模型】’Whisper - a general-purpose speech recognition model’ GitHub: github.com/openai/whisper

【用youtube-dl+OpenAI's Whisper为Youtube视频自动生成字幕】’Automatic YouTube subtitle generation - Using OpenAI's Whisper to automatically generate YouTube subtitles' by Miguel Piedrafita GitHub: github.com/m1guelpf/yt-whisper

基于 Tensorflow 实现的音轨分离工具。可以用于提取音乐中的人声、鼓、钢琴等乐器 https://github.com/deezer/spleeter

基于深度学习的中文语音识别系统 https://github.com/nl8590687/ASRT_SpeechRecognition

【OpenAI Whisper语音识别的简单web演示界面】’openai-whisper-webapp - Code for OpenAI Whisper Web App Demo' by amrrs GitHub: github.com/amrrs/openai-whisper-webapp

【Whispering:基于whisper的流语音转录(字幕生成)】’Whispering - Streaming transcriber with whisper' by shirayu GitHub: github.com/shirayu/whispering

【Whisper ASR Webservice:Whisper语音识别的Webservice】’Whisper ASR Webservice - OpenAI Whisper ASR Webservice API' by Ahmet Oner GitHub: github.com/ahmetoner/whisper-asr-webservice

【Automatic subtitles in your videos:用ffmpeg+OpenAI's Whisper为视频文件自动加字幕】’Automatic subtitles in your videos - Automatically generate and overlay subtitles for any video.' by Miguel Piedrafita GitHub: github.com/m1guelpf/auto-subtitle

【whisper.cpp:OpenAI's Whisper高质量语音识别模块C/C++移植版,无依赖低内存支持CPU跨平台】’whisper.cpp - Port of OpenAI's Whisper model in C/C++' by Georgi Gerganov GitHub: github.com/ggerganov/whisper.cpp

【Sound Synthesis Recipes:C++音频合成代码集】’Sound Synthesis Recipes - Code snippets of sound synthesis algorithms in C++' by Matthijs Hollemans GitHub: github.com/hollance/synth-recipes

[AS]《Hierarchical Diffusion Models for Singing Voice Neural Vocoder》N Takahashi, M Kumar, Singh, Y Mitsufuji [Sony Group Corporation] (2022) https://arxiv.org/abs/2210.07508

【ICASSP2022 TTS&VC Summary:总结了ICASSP2022中TTS和VC相关论文,主要是TTS】'ICASSP2022 TTS&VC Summary - ICASSP2022 TTS&VC Summary' by Liumeng Xue GitHub: github.com/lmxue/ICASSP2022_TTS_VC_Summary

【EnCodec: 高保真神经音频压缩编码器】’EnCodec: High Fidelity Neural Audio Compression - State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.' by Meta Research GitHub: github.com/facebookresearch/encodec

【OpenAI Whisper - CPU:将量化方法应用于 OpenAI Whisper ASR 模型以提高基于CPU部署的推理速度和吞吐量的实验】’OpenAI Whisper - CPU - Improving transcription performance of OpenAI Whisper for CPU based deployment' by MiscellaneousStuff GitHub: github.com/MiscellaneousStuff/openai-whisper-cpu

【FunASR: 基础端到端语音识别工具包】'FunASR: A Fundamental End-to-End Speech Recognition Toolkit’ by Alibaba Damo Academy GitHub: github.com/alibaba-damo-academy/FunASR

【mayavoz:PyTorch语音增强工具包】'mayavoz - Pytorch based speech enhancement toolkit.' by Shahul ES GitHub: github.com/shahules786/mayavoz

【libf0:用于音乐录制中基频估计的Python库】'libf0 - A Python Library for Fundamental Frequency Estimation in Music Recordings' by GroupMM GitHub: github.com/groupmm/libf0

【ASR Corpus Creator:用伪标注创建自动语音识别语料库】’ASR Corpus Creator - This app is intended to automatically create a corpus for ASR systems using pseudo-labeling.' by Yehor Smoliakov GitHub: github.com/egorsmkv/asr-corpus-creator

【WhisperX:强制时间对齐的时间戳精确版Whisper语音识别】’WhisperX - WhisperX: Timestamp-Accurate Automatic Speech Recognition.' by m-bain GitHub: github.com/m-bain/whisperX

【Speech-Editing-Toolkit:集成最新深度学习算法的语音编辑工具箱】’Speech-Editing-Toolkit - It's a repository for implementations of neural speech editing algorithms.' by Jiangzy GitHub: github.com/Zain-Jiang/Speech-Editing-Toolkit

【教程:基于视觉Transformer(ViT)的音频分类(Colab)】《Audio classification with Vision Transformers》 https://colab.research.google.com/drive/1mnArj9S7cij3Ua-dHXoasKWqyNA-GCrT?usp=sharing

【whisperer:基于Whisper的文本-音频数据集构建工具】’whisperer - Go from raw audio files to a text-audio dataset automatically with OpenAI's Whisper.' by Miguel Valente GitHub: github.com/miguelvalente/whisperer

【KAN-TTS:支持中英文的语音合成训练框架】’KAN-TTS - a speech-synthesis training framework' by Alibaba Damo Academy GitHub: github.com/alibaba-damo-academy/KAN-TTS

【Speechbox:语音处理工具包】’Speechbox offers a set of speech processing tools, such as punctuation restoration' by Hugging Face GitHub: github.com/huggingface/speechbox

【Larynx:快速的本地部署神经文本语音合成工具,目前支持英语、德语、丹麦语、挪威语、尼泊尔语、越南语等】’Larynx - A fast, local neural text to speech system' Rhasspy GitHub: github.com/rhasspy/larynx2

'Fish Diffusion - 基于 diff-svc 实现的 TTS / SVS / SVC 的训练框架,用于实现歌声音色转换’ Fish Audio GitHub: github.com/fishaudio/fish-diffusion

【Whisper:用 C++ 重写的 OpenAI's Whisper 语音识别程序的高性能 GPGPU 接口,64-bit Win版,比Pytorch版快一倍多】’Whisper - High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model'

Konstantin GitHub: github.com/Const-me/Whisper

'SoftVC VITS Singing Voice Conversion - 基于vits与softvc的歌声音色转换模型' innnky GitHub: github.com/innnky/so-vits-svc

【音频AI模型进展追踪】’Audio AI Timeline - A timeline of the latest AI models for audio generation, starting in 2023!' archinet GitHub: github.com/archinetai/audio-ai-timeline

【Real Time Whisper Transcription:基于 OpenAI Whisper 的实时语音转录(语音识别)】’Real Time Whisper Transcription - Real time transcription with OpenAI Whisper.' davabase GitHub: github.com/davabase/whisper_real_time

【WaaS - Whisper as a Service:基于 Whisper 的语音转录服务】’WaaS - Whisper as a Service - Whisper as a Service (GUI and API for OpenAI Whisper)' Schibsted GitHub: github.com/schibsted/WAAS

【基于 CTranslate2 的更快的 Whisper 语音转录】’Faster Whisper transcription with CTranslate2 - Faster Whisper transcription with CTranslate2' Guillaume Klein GitHub: github.com/guillaumekln/faster-whisper

【基于 OpenAI Whisper 的说话人分割】’Speaker Diarization Using OpenAI Whisper - Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper' Mahmoud Ashraf GitHub: github.com/MahmoudAshraf97/whisper-diarization

【Audio Slicer:根据静默片段分割音频的Python脚本】’Audio Slicer - Python script that slices audio with silence detection' Team OpenVPI GitHub: github.com/openvpi/audio-slicer

【transcribe-anything:基于 Whisper 的语音转录服务】’transcribe-anything - Input a local file or url and this service will transcribe it using Whisper AI' Zachary Vorhies GitHub: github.com/zackees/transcribe-anything

【audioFlux:音频/音乐分析与特征提取库】'audioFlux - A library for audio and music analysis, feature extraction.' audioFlux GitHub: github.com/libAudioFlux/audioFlux

【Whisper OpenVINO:OpenVINO版运行更快的 Whisper 语音转录】’Whisper OpenVINO - openvino version of openai/whisper' Zilin Zhu GitHub: github.com/zhuzilin/whisper-openvino

【同声翻译(文本到文本/语音到文本翻译)相关资源大列表】’Awesome Simultaneous Translation - Paper list of simultaneous translation, including text-to-text machine translation and speech-to-text translation.' ZhangShaolei1998 GitHub: github.com/Vily1998/Awesome-Simultaneous-Translation

【Decipher:基于 whisper 给视频自动加字幕】’Decipher - Effortlessly add AI-generated transcription subtitles to your videos' dsymbol GitHub: github.com/dsymbol/deciphe

【whisper-timestamped:基于openai-whisper的多语言自动语音识别(ASR)工具,可以将音频文件转换为文本,并为每个单词提供时间戳】'whisper-timestamped - Multilingual Automatic Speech Recognition with word-level timestamps and confidence' linto.ai GitHub: github.com/linto-ai/whisper-timestamped

【Subs AI:基于Whisper及其变体的字幕生成工具(】'Subs AI - Subtitles generation tool (Web-UI + CLI + Python package) powered by OpenAI's Whisper and its variants' Abdeladim Sadiki GitHub: github.com/abdeladim-s/subsa

【SpeechGPT:免费、开源的ChatGPT语音聊天应用,支持100多种语言,具备优秀的隐私保护和语音识别、语音合成功能】'SpeechGPT - a web application that enables you to converse with ChatGPT.' Xi 网页链接 GitHub: github.com/hahahumble/speechgpt

【Transcriber:采用 Flet 和 OpenAI Whisper 构建的实时语音转文字转录应用】'Transcriber - Real time speech to text transcription app.' davabase GitHub: github.com/davabase/transcriber_app

【Whispering Tiger (Live Translate/Transcribe):免费的开源工具,可以监听/观看机器上的任意音频流或游戏图像,通过Websockets或OSC将转录或翻译输出到Web浏览器】'Whispering Tiger (Live Translate/Transcribe) - Whispering Tiger - OpenAI's whisper with OSC and Websocket support. Allowing live transcription / translation in VRChat and Overlays in most Streaming Applications' Sharrnah GitHub: github.com/Sharrnah/whispering

针对OpenAI开源的语音转文本模型whisper的UI界面 🔗 gitlab.com/aadnk/whisper-webui

【whisper_streaming:基于Whisper的语音实时转录,面向长语音文本转录和翻译】'whisper_streaming - Whisper realtime streaming for long speech-to-text transcription and translation' ÚFAL GitHub: github.com/ufal/whisper_streaming

【Kesha v3.0 very early (aka Jarvis update):基于 Silero TTS + Vosk STT + Picovoice Porcupine + ChatGPT 的智能语音助手实验】'Kesha v3.0 very early (aka Jarvis update) - Voice Assistant made as an experiment using Silero TTS + Vosk STT + Picovoice Porcupine + ChatGPT.' Abraham Tugalov GitHub: github.com/Priler/jarvis

faster-whisper是对OpenAI的Whisper模型的重新实现,使用的是CTranslate2引擎,CTranslate2(github.com/OpenNMT/CTranslate2)是一个用于Transformer模型的快速推理引擎。 这个模型的速度是官方的Whisper性能的4-8倍。 🔗 github.com/guillaumekln/faster-whisper

【支持音色克隆的文本到音频生成,支持中文】’Bark...but with the ability to use voice cloning on custom audio/text pairs - Text-prompted Generative Audio Model - With the ability to clone voices' SERP AI GitHub: github.com/serp-ai/bark-with-voice-clone

【Audio Slicer:音频切片机,简约的 GUI 应用程序,通过静音检测对音频进行切片】'Audio Slicer - A simple GUI application that slices audio with silence detection' flutydeer GitHub: github.com/flutydeer/audio-slicer

So-vits-svc(也称Sovits)是基于VITS、soft-vc、VISinger2等一系列项目开发的一款开源免费AI 语音转换软件。 很多AI翻唱就是用Sovits训练的。 🔗github.com/svc-develop-team/so-vits-svc

【libvits-ncnn:VITS库的ncnn实现,可实现跨平台GPU加速语音合成。使用ncnn库实现深度学习推理,并支持CPU和GPU上的推理】'libvits-ncnn - libvits-ncnn is an ncnn implementation of the VITS library that enables cross-platform GPU-accelerated speech synthesis.' SgDylan GitHub: github.com/Sg4Dylan/libvits-ncnn

【SummerTTS:基于C++的独立编译的中文语音合成项目,可以在本地运行且无需网络连接。它没有额外的依赖,可以在C++环境下独立编译和运行。项目使用Eigen库实现了神经网络的算子,无需依赖像pytorch,tensorflow, ncnn等其他神经网络环境。模型基于语音合成算法vits,可以在Ubuntu、Android和树莓派等Linux平台上运行。此项目提供了一键编译,用户可以将下载的模型放入项目的model目录中,然后通过命令行进行编译和测试语音合成。此外,该项目提供了不同大小的模型,以适应不同的计算能力和音质需求】'SummerTTS - a standalone Chinese speech synthesis(TTS) project that has almost no dependency and could be easily used for Chinese TTS with just one key build out' huakunyang GitHub: github.com/huakunyang/SummerTTS

【Whisper API Streaming:项目旨在为OpenAI的Whisper模型API提供一个流接口。目前只支持响应的流功能】'Whisper API Streaming - Thin wrapper around OpenAI Whisper API with streaming support' George Korepanov GitHub: github.com/gkorepanov/whisper-stream

【whisper-ctranslate2:与原始的基于CTranslate2的OpenAI客户端兼容的命令行客户端,使用CTranslate2和Faster-whisper Whisper实现,相较于openai/whisper,速度提高了4倍,同时占用更少的内存】’whisper-ctranslate2 - Whisper command line client compatible with original OpenAI client based on CTranslate2.' Softcatalà GitHub: github.com/Softcatala/whisper-ctranslate2

【声音活动检测(VAD)相关论文和代码资源】’Voice activity detection (VAD) paper and code - Voice activity detection (VAD) paper(From 198*~2019)and its classification. The arrangement of these papers was arranged when I was studying for a double master degree in UNOKI LAB of JAIST. Now share it with those in need to learn.' LI NAN GitHub: github.com/linan2/Voice-activity-detection-VAD-paper-and-code

【whisper-onnx-cpu:ONNX实现的whisper,不依赖于PyTorch or TensorFlow即可运行】’whisper-onnx-cpu - ONNX implementation of Whisper. PyTorch free.' Katsuya Hyodo GitHub: github.com/PINTO0309/whisper-onnx-cpu

介绍了一种名为LibriTTS-R的语音数据集,通过语音修复技术提高了语音样本的质量,为TTS研究提供了加速。 https://arxiv.org/abs/2305.18802 [AS]《LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus》Y Koizumi, H Zen, S Karita, Y Ding, K Yatabe, N Morioka, M Bacchiani, Y Zhang, W Han, A Bapna [Google & Tokyo University of Agriculture] (2023)

Meta 今天在 GitHub 开源的 Python 库:Audiocraft,可直接用 AI 生成音乐。 GitHub:github.com/facebookresearch/audiocraft 里面主要用到了一个名为 MusicGen 的音乐生成模型,MusicGen 是一个单级自回归 Transformer 模型,在 32kHz EnCodec 分词器上训练,具有 4 个以 50Hz 采样的码本。

'TTS Generation WebUI (Bark v2, MusicGen, Tortoise, Vocos)' Roberts Slisans GitHub: github.com/rsxdalv/tts-generation-webui

【在不超过2GB VRAM GPU的普通消费硬件上生成和训练短音频样本】'A repository for generating and training short audio samples with unconditional waveform diffusion on accessible consumer hardware (<2GB VRAM GPU)' Christopher Landschoot GitHub: github.com/crlandsc/tiny-audio-diffusion

【PhoneLM:用音素作为输入和音频编解码码字作为输出的文本转语音(TTS),基于MegaByte、VALL-E和Encodec模型,使用G2P将文本编码为音素,使用encodec对音频进行编码和解码】’PhoneLM - (R&D) Text to speech using phonemes as inputs and audio codec codes as outputs. Loosely based on MegaByte, VALL-E and Encodec.' MiscellaneousStuff GitHub: github.com/MiscellaneousStuff/PhoneLM

【free-music-demixer:免费的客户端静态网站,用于音乐分离(也称为音源分离),使用了Open-Unmix的AI模型(UMX-L权重)】’free-music-demixer - Open-Unmix (UMX-L) running client-side in the browser with WebAssembly' Sevag H GitHub: github.com/sevagh/free-music-demixer

Memo - AI 驱动的视频、播客转文字、字幕工具 字幕识别和翻译的工具Memo 支持多平台,利用Whisper技术识别语音到到字幕,然后可以对识别的字幕进行简单的编辑。 另外可以对识别的字幕翻译,支持Google翻译和OpenAI(需要自己的API Key) 界面操作友好,语音识别效果不错,普通句子翻译效果也挺好,不过遇到复杂的句子,不能对字幕合并稍微有点麻烦。 https://mxmefbp9p0g.feishu.cn/docx/ZI3ldweTXorTvMxYLbucT00Un5n

【RTVC: Real-Time Voice Conversion GUI:实时语音转换(变声)界面】’RTVC: Real-Time Voice Conversion GUI' Fish Audio GitHub: github.com/fishaudio/realtime-vc-gui

【Wordcab Transcribe:用faster-whisper和多尺度自适应谱聚类进行语音识别(ASR)的FastAPI服务】'Wordcab Transcribe- ASR FastAPI server using faster-whisper and Multi-Scale Auto-Tuning Spectral Clustering for diarization.' Wordcab GitHub: github.com/Wordcab/wordcab-transcribe

【april-asr:C语言写的语音转文本(STT)库】’april-asr - Speech-to-text library in C' abb128 GitHub: github.com/abb128/april-asr

’SummerAsr - 基于C++的可独立编译且几乎没有额外依赖库的本地中文语音识别器。 Summer Asr is a Chinese automatic speech recognize project written with C++ that can be easily built standalone without any depencency.' huakunyang GitHub: github.com/huakunyang/SummerAsr

【基于Grad-TTS的歌唱转换】’Grad-SVC based Grad-TTS from HUAWEI Noah's Ark Lab - Singing Voice Conversion based on Grad-TTS. The core algorithm is diffusion.' PlayVoice GitHub: github.com/PlayVoice/Grad-SVC

【SpeechMOS:只需 2 行代码即可预测主观语音得分,支持多种 MOS 预测系统】'SpeechMOS - Easy-to-Use Speech MOS predictors' tarepan GitHub: github.com/tarepan/SpeechMOS

【开源自动语音识别(ASR)模型排行榜】《Open ASR Leaderboard - a Hugging Face Space by hf-audio》 https://huggingface.co/spaces/hf-audio/

【Light Speed:基于 VITS 的开源文本转语音模型】'Light Speed - A modified VITS that utilizes phoneme duration's ground truth for better robustness' NTT123 GitHub: github.com/NTT123/light-speed

lalal.ai,这个音频处理工具太牛了,它可以对复杂的合成音轨进行精准分离和无损提取。我试了一下,效果非常好。 它主要用于两个场景,一个是音轨剥离,一个是声音移除,例如它可以提取人声、鼓、贝斯、吉他和弦乐等声音,也可以去除背景音乐、麦克风隆隆声以及其他不需要的噪音。下面的视频演示了剥离伴奏和人声的效果,还是比较直观的。 也去搜罗了下实现原理,找到一篇介绍 MSS(Musical Source Separation)的论文:inria.hal.science/hal-01945345/document,它介绍了基于模型和基于信号处理的两种较为传统的处理方式,也提到,当前引入深度神经网络来解决这个问题的应用越来越多,不过最大的局限性还是可用于学习的数据太少,例如你让工具单独提取音频中鸟叫的声音,可能就比较吃力。

Whisper语音识别模型 INT4 低精度版,可以在计算资源有限的环境中更快地运行: huggingface.co/Intel/whisper-tiny-onnx-int4 huggingface.co/Intel/whisper-base-onnx-int4 huggingface.co/Intel/whisper-small-onnx-int4 huggingface.co/Intel/whisper-medium-onnx-int4 huggingface.co/Intel/whisper-large-onnx-int4 huggingface.co/Intel/whisper-large-v2-onnx-int4

'Open-Lyrics - Transcribe (whisper) and translate (gpt) voice into LRC file. 用whisper和gpt将音频转录、翻译为字幕文件' zh-plus GitHub: github.com/zh-plus/openlrc

【Insanely Fast Whisper:超快的Whisper语音识别脚本,用OpenAI的Whisper Large v2在10分钟内转录5小时的音频】’Insanely Fast Whisper' by Vaibhav Srivastav GitHub: github.com/Vaibhavs10/insanely-fast-whisper

Voice Changer 是一款实时语音转换客户端,支持Windows和Mac。 它可以实时变声成其他人或者虚拟角色的音色,可以接入多种语音转换技术,例如:

  • MMVC(github.com/isletennos/MMVC_Trainer)
  • so-vits-svc (github.com/svc-develop-team/so-vits-svc)
  • RVC(Retrieval-based-Voice-Conversion) (github.com/liujing04/Retrieval-based-Voice-Conversion-WebUI)
  • DDSP-SVC (github.com/yxlllc/DDSP-SVC) 对于如何使用,有一个YouTube教学视频讲的蛮详细:www.youtube.com/watch?v=_JXbvSTGPoo 项目地址:github.com/w-okada/voice-changer

【Distil-Whisper:蒸馏版Whisper,将语音识别速度提高6倍,模型瘦身49%】’Distil-Whisper' by Hugging Face GitHub: github.com/huggingface/distil-whisper

Insanely Fast Whisper:使用 OpenAI 的 Whisper Large v2 在 10 分钟内转录 300 分钟(5 小时)的音频。 地址:github.com/Vaibhavs10/insanely-fast-whisper

【RealtimeSTT:实时语音转文本库,实现了语音转文本的主流算法,性能优异,易于集成和应用,对开发语音助手、语音表单应用等实时语音交互系统很有帮助】’RealtimeSTT - A robust, efficient, low-latency speech-to-text library with advanced voice activity detection, wake word activation and instant transcription. Designed for real-time applications like voice assistants.' Kolja Beigel GitHub: github.com/KoljaB/RealtimeSTT

声音克隆项目,只要几秒钟的音频样本就能创造出AI语音克隆。刚刚的发布了XTTS v2,包括以下重要更新: ✅ 更出色的零样本克隆能力 ✅ 可以用更多数据进行克隆 ✅ 更加自然的语调和表达力 ✅ 支持匈牙利语和韩语 项目地址:github.com/coqui-ai/tts

【whisper-cpp-python:whisper.cpp的Python封装】’whisper-cpp-python - whisper.cpp bindings for python' Carlos Cardoso Dias GitHub: github.com/carloscdias/whisper-cpp-python

Meta 新推出的实时语音翻译模型 Seamless,能保持原声的表情和风格。 它比较先进的地方在于能判断当前的上下文是否足够输出,如果还不足以判断语音的真实含义,会等待有足够输入后再输出。 号称在语音生成文本和语音翻译方面超越了 Whisper 和 AudioPalm 2。 Seamless 包含一系列的语音模型:

  • SeamlessM4Tv2:一款基础的多语种模型
  • SeamlessStreaming:提供实时翻译功能
  • SeamlessExpressive:能在翻译过程中保留原声的表情和风格
  • Seamless:将以上所有模型集成在一起 Github: github.com/facebookresearch/seamless_communication

【Insanely Fast Whisper (CLI):基于Whisper语音识别模型的超快音频转文字命令行工具,用Whisper Large v2在10分钟内转录300分钟音频】’Insanely Fast Whisper (CLI) - The fastest Whisper optimization for automatic speech recognition as a command-line interface' ochen1 GitHub: github.com/ochen1/insanely-fast-whisper-cli

【abracadabra: Python写的歌曲识别工具,实现了Shazam论文中的音频搜索算法,可以通过电脑的麦克风识别正在播放的歌曲,可以用于多个视频的音频对齐和音乐库去重等应用】'abracadabra: Sound recognition in Python' Cameron MacLeod GitHub: github.com/notexactlyawe/abracadabra