Lists (1)
Sort Name descending (Z-A)
Stars
Codebase for 'Scaling Rich Style-Prompted Text-to-Speech Datasets'
Retrieval-Augmented Theorem Provers for Lean
Fast and memory-efficient exact attention
UniCodec: a unified audio codec with a single codebook to support multi-domain audio data, including speech, music, and sound
An neural full-band audio codec for general audio sampled at 48 kHz with 7.5 kps or 4.5 kbps.
SlamKit is an open source tool kit for efficient training of SpeechLMs. It was used for "Slamming: Training a Speech Language Model on One GPU in a Day"
✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Baichuan-Audio: A Unified Framework for End-to-End Speech Interaction
Muon optimizer: +>30% sample efficiency with <3% wallclock overhead
VoiceBench: Benchmarking LLM-Based Voice Assistants
The official repository of SpeechCraft dataset, a large-scale expressive bilingual speech dataset with natural language descriptions.
OSUM: Open Speech Understanding Model, open-sourced by ASLP@NPU.
Implementation of the sparse attention pattern proposed by the Deepseek team in their "Native Sparse Attention" paper
SSR-Speech: Towards Stable, Safe and Robust Zero-shot Speech Editing and Synthesis
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS …
🚀🚀 「大模型」2小时完全从0训练26M的小参数GPT!🌏 Train a 26M-parameter GPT from scratch in just 2h!
A high-throughput and memory-efficient inference and serving engine for LLMs
Witness the aha moment of VLM with less than $3.
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics rec…
Ultra-low-bitrate Speech Codec for Speech Language Modeling Applications
Starter code for working with the YouTube-8M dataset.
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis