Skip to content

Feature request: Add FunASR/SenseVoice as audio model backend #4973

@LauraGPT

Description

@LauraGPT

Summary

Xinference supports various LLM/embedding/image/audio models. Would you consider adding FunASR models (SenseVoice, Paraformer, Fun-ASR-Nano) as audio/speech model backends?

Why FunASR?

FunASR is the most popular open-source ASR toolkit for Chinese and multilingual speech recognition:

  • SenseVoice (234M params): Non-autoregressive, ~25x faster than Whisper-large, 50+ languages, emotion + audio event detection
  • Paraformer (220M params): ~170x realtime on GPU for Chinese, built-in VAD + punctuation
  • Fun-ASR-Nano (800M params): LLM-based ASR (SenseVoice encoder + Qwen3-0.6B decoder), 31 languages
  • cam++: Speaker diarization model (7.2M params)

Integration

FunASR already provides an OpenAI-compatible API server:

pip install funasr vllm
funasr-server --device cuda
# http://localhost:8000/v1/audio/transcriptions

This could serve as a reference for integrating into Xinference's model serving framework.

Model ecosystem

Model Task Params Speed
SenseVoice-Small ASR + emotion 234M ~25x vs Whisper
Paraformer-large Chinese ASR 220M ~170x realtime
Fun-ASR-Nano Multilingual ASR 800M LLM-based
FSMN-VAD Voice Activity Detection 0.4M
CT-Punc Punctuation
cam++ Speaker Diarization 7.2M

All models available on ModelScope and HuggingFace (FunAudioLLM org).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions