Feature request: Add FunASR/SenseVoice as audio model backend

## Summary

Xinference supports various LLM/embedding/image/audio models. Would you consider adding [FunASR](https://github.com/modelscope/FunASR) models (SenseVoice, Paraformer, Fun-ASR-Nano) as audio/speech model backends?

## Why FunASR?

FunASR is the most popular open-source ASR toolkit for Chinese and multilingual speech recognition:

- **SenseVoice** (234M params): Non-autoregressive, ~25x faster than Whisper-large, 50+ languages, emotion + audio event detection
- **Paraformer** (220M params): ~170x realtime on GPU for Chinese, built-in VAD + punctuation
- **Fun-ASR-Nano** (800M params): LLM-based ASR (SenseVoice encoder + Qwen3-0.6B decoder), 31 languages
- **cam++**: Speaker diarization model (7.2M params)

## Integration

FunASR already provides an OpenAI-compatible API server:

```bash
pip install funasr vllm
funasr-server --device cuda
# http://localhost:8000/v1/audio/transcriptions
```

This could serve as a reference for integrating into Xinference's model serving framework.

## Model ecosystem

| Model | Task | Params | Speed |
|-------|------|--------|-------|
| SenseVoice-Small | ASR + emotion | 234M | ~25x vs Whisper |
| Paraformer-large | Chinese ASR | 220M | ~170x realtime |
| Fun-ASR-Nano | Multilingual ASR | 800M | LLM-based |
| FSMN-VAD | Voice Activity Detection | 0.4M | — |
| CT-Punc | Punctuation | — | — |
| cam++ | Speaker Diarization | 7.2M | — |

All models available on ModelScope and HuggingFace (FunAudioLLM org).

## References

- FunASR: https://github.com/modelscope/FunASR (16K+ stars)
- SenseVoice: https://github.com/FunAudioLLM/SenseVoice (8.3K+ stars)
- Install: `pip install funasr`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: Add FunASR/SenseVoice as audio model backend #4973

Summary

Why FunASR?

Integration

Model ecosystem

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Model	Task	Params	Speed
SenseVoice-Small	ASR + emotion	234M	~25x vs Whisper
Paraformer-large	Chinese ASR	220M	~170x realtime
Fun-ASR-Nano	Multilingual ASR	800M	LLM-based
FSMN-VAD	Voice Activity Detection	0.4M	—
CT-Punc	Punctuation	—	—
cam++	Speaker Diarization	7.2M	—

Feature request: Add FunASR/SenseVoice as audio model backend #4973

Description

Summary

Why FunASR?

Integration

Model ecosystem

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions