Feature: Add SenseVoice as STT engine — 10x faster than Whisper, 50+ languages

Hi! Voicebox is an incredible project. I noticed the roadmap mentions adding more STT engines (Parakeet v3, Qwen3-ASR).

I'd like to suggest [SenseVoice](https://github.com/FunAudioLLM/SenseVoice) as another engine option. It's a non-autoregressive ASR model that's particularly well-suited for dictation/voice input use cases:

**Why SenseVoice fits Voicebox well:**
- **~10x faster than Whisper Large** — 50ms for 10s audio on GPU, making dictation feel instant
- **Non-autoregressive**: Single forward pass, no beam search — extremely predictable latency
- **50+ languages** including Chinese, English, Japanese, Korean, French, German, Spanish, etc.
- **Built-in emotion detection**: Could enrich Captures with emotional context
- **234M parameters** — smaller than Whisper Turbo, runs great on consumer hardware
- **ONNX/MLX compatible** via [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) for Apple Silicon optimization

**Integration via FunASR (Python/PyTorch):**
```python
from funasr import AutoModel

model = AutoModel(model="iic/SenseVoiceSmall")
result = model.generate(input=audio_chunk)
text = result[0]["text"]
```

**Or for Apple Silicon via Sherpa-ONNX (C++/Swift):**
SenseVoice models are available in ONNX format and run natively on macOS/iOS via Sherpa-ONNX, which could integrate well with Voicebox's MLX architecture.

**OpenAI-compatible server mode:**
```bash
pip install funasr
funasr-server --device cuda  # or --device cpu for Mac
# Same /v1/audio/transcriptions endpoint
```

For the push-to-talk dictation workflow, the speed difference between Whisper and SenseVoice is very noticeable — text appears almost instantly after releasing the hotkey. Happy to help with integration!

- SenseVoice: https://github.com/FunAudioLLM/SenseVoice (8K+ stars)
- FunASR toolkit: https://github.com/modelscope/FunASR (16K+ stars)
- Sherpa-ONNX (cross-platform): https://github.com/k2-fsa/sherpa-onnx (5K+ stars)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: Add SenseVoice as STT engine — 10x faster than Whisper, 50+ languages #720

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Feature: Add SenseVoice as STT engine — 10x faster than Whisper, 50+ languages #720

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions