Hi! Voicebox is an incredible project. I noticed the roadmap mentions adding more STT engines (Parakeet v3, Qwen3-ASR).
I'd like to suggest SenseVoice as another engine option. It's a non-autoregressive ASR model that's particularly well-suited for dictation/voice input use cases:
Why SenseVoice fits Voicebox well:
- ~10x faster than Whisper Large — 50ms for 10s audio on GPU, making dictation feel instant
- Non-autoregressive: Single forward pass, no beam search — extremely predictable latency
- 50+ languages including Chinese, English, Japanese, Korean, French, German, Spanish, etc.
- Built-in emotion detection: Could enrich Captures with emotional context
- 234M parameters — smaller than Whisper Turbo, runs great on consumer hardware
- ONNX/MLX compatible via Sherpa-ONNX for Apple Silicon optimization
Integration via FunASR (Python/PyTorch):
from funasr import AutoModel
model = AutoModel(model="iic/SenseVoiceSmall")
result = model.generate(input=audio_chunk)
text = result[0]["text"]
Or for Apple Silicon via Sherpa-ONNX (C++/Swift):
SenseVoice models are available in ONNX format and run natively on macOS/iOS via Sherpa-ONNX, which could integrate well with Voicebox's MLX architecture.
OpenAI-compatible server mode:
pip install funasr
funasr-server --device cuda # or --device cpu for Mac
# Same /v1/audio/transcriptions endpoint
For the push-to-talk dictation workflow, the speed difference between Whisper and SenseVoice is very noticeable — text appears almost instantly after releasing the hotkey. Happy to help with integration!
Hi! Voicebox is an incredible project. I noticed the roadmap mentions adding more STT engines (Parakeet v3, Qwen3-ASR).
I'd like to suggest SenseVoice as another engine option. It's a non-autoregressive ASR model that's particularly well-suited for dictation/voice input use cases:
Why SenseVoice fits Voicebox well:
Integration via FunASR (Python/PyTorch):
Or for Apple Silicon via Sherpa-ONNX (C++/Swift):
SenseVoice models are available in ONNX format and run natively on macOS/iOS via Sherpa-ONNX, which could integrate well with Voicebox's MLX architecture.
OpenAI-compatible server mode:
For the push-to-talk dictation workflow, the speed difference between Whisper and SenseVoice is very noticeable — text appears almost instantly after releasing the hotkey. Happy to help with integration!