Skip to content

Feature: Add SenseVoice as STT engine — 10x faster than Whisper, 50+ languages #720

@LauraGPT

Description

@LauraGPT

Hi! Voicebox is an incredible project. I noticed the roadmap mentions adding more STT engines (Parakeet v3, Qwen3-ASR).

I'd like to suggest SenseVoice as another engine option. It's a non-autoregressive ASR model that's particularly well-suited for dictation/voice input use cases:

Why SenseVoice fits Voicebox well:

  • ~10x faster than Whisper Large — 50ms for 10s audio on GPU, making dictation feel instant
  • Non-autoregressive: Single forward pass, no beam search — extremely predictable latency
  • 50+ languages including Chinese, English, Japanese, Korean, French, German, Spanish, etc.
  • Built-in emotion detection: Could enrich Captures with emotional context
  • 234M parameters — smaller than Whisper Turbo, runs great on consumer hardware
  • ONNX/MLX compatible via Sherpa-ONNX for Apple Silicon optimization

Integration via FunASR (Python/PyTorch):

from funasr import AutoModel

model = AutoModel(model="iic/SenseVoiceSmall")
result = model.generate(input=audio_chunk)
text = result[0]["text"]

Or for Apple Silicon via Sherpa-ONNX (C++/Swift):
SenseVoice models are available in ONNX format and run natively on macOS/iOS via Sherpa-ONNX, which could integrate well with Voicebox's MLX architecture.

OpenAI-compatible server mode:

pip install funasr
funasr-server --device cuda  # or --device cpu for Mac
# Same /v1/audio/transcriptions endpoint

For the push-to-talk dictation workflow, the speed difference between Whisper and SenseVoice is very noticeable — text appears almost instantly after releasing the hotkey. Happy to help with integration!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions