InteractiveOmni

InteractiveOmni-4B 🤗 | InteractiveOmni-8B 🤗 | 📑 Paper

News

2025.10.15: 👋 We release the inference code and model weights of InteractiveOmni-4B and InteractiveOmni-8B.
2025.10.15: 👋 We release the technical report of InteractiveOmni.

Introduction

InteractiveOmni is a unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and video and directly generate coherent text and speech streams, achieving truly integrated interaction.

This is the schematic diagram for multi-turn audio-visual interaction.

Key Features

Strong Performance Across Modalities: Exhibiting omni-modal understanding and speech generation capabilities. InteractiveOmni outperforms the similarly sized vision-language models, audio-language models and omni-modal models.
State-of-the-Art Performance: Achieve SOTA results on various open-source benchmarks for image, audio, and video understanding, as well as speech conversation.
Excellent Interactive Performance: Achieve more intelligent audio-visual experience with multi-turn and long-term memory capabilities.
Multi-turn Interactive Benchmarks: Propose multi-modal multi-turn benchmark to evaluate multi-turn memory and speech interaction of leading MLLMs.
On-device Model: the 4B model achieves 97% of the performance with just 50% of the model size compared with 8B model.

Model Architecture

Quickstart

Get the Code

git clone https://github.com/SenseTime-FVG/InteractiveOmni.git
cd InteractiveOmni
pip install -r requirements.txt

We provide an example code to run InteractiveOmni using 🤗 Transformers.

Please use transformers>=4.51.0 and FlashAttention2 to ensure the model works normally.

Model Loading

import torch
from transformers import AutoTokenizer, AutoModel
path = "sensefvg/InteractiveOmni-8B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).eval().cuda()

Inference with Transformers

import torch
from transformers import AutoModel, AutoTokenizer
import torchaudio

path = "sensefvg/InteractiveOmni-8B"
model = AutoModel.from_pretrained(
    path,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=True)

# set the max number of tiles in `max_num`
max_num = 12
frame = 8
generation_config = dict(max_new_tokens=1024, do_sample=True)

# pure-text conversation (纯文本对话)
messages = [
    {
        'role': "user",
        'content': 'Hello, who are you?',
    }
]
response = model.chat(tokenizer, generation_config, messages)

# audio conversation (音频对话)
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_en.wav"
            }
        ]
    }
]
response = model.chat(tokenizer, generation_config, messages)

## Generate both audio and text output
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_zh.wav"
            }
        ]
    }
]
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
torchaudio.save("result.wav", wav_response.cpu(), 24000, format="wav")

# image-text conversation (图文对话)
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "image",
                "image": 'assets/cat_cup.jpeg'
            },
            {
                "type": "text",
                "text": "Please describe the image shortly."
            }
        ]
    }
]
response = model.chat(tokenizer, generation_config, messages, max_num)

# image-audio conversation (图音对话)
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "image",
                "image": 'assets/cat_cup.jpeg'
            },
            {
                "type": "audio",
                "audio": "assets/describe_img_en.wav"
            }
        ]
    }
]
response = model.chat(tokenizer, generation_config, messages, max_num)

## image-audio conversation, generate both audio and text output
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "image",
                "image": 'assets/cat_cup.jpeg'
            },
            {
                "type": "audio",
                "audio": "assets/describe_img_en.wav"
            }
        ]
    }
]
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
torchaudio.save("result.wav", wav_response.cpu(), 24000, format="wav")

# video conversation (视频对话)
messages = [
    {
        'role': "user",
        'content': [
            {
                "type": "video",
                "video": 'video_path'
            },
            {
                "type": "text",
                "text": "Describe this video in detail."
            }
        ]
    }
]
response = model.chat(tokenizer, generation_config, messages, max_num, frame)

Use audio output

If users need audio output, the system prompt must be set as follows, otherwise the audio output may not work as expected.

You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech.

messages = [
    {
        "role": "system",
        "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
    },
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_zh.wav",
            }
        ]
    }
]
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
torchaudio.save("result_none_speaker.wav", wav_response.cpu(), 24000, format="wav")

Use default speaker to generate output audio.

messages = [
    {
        "role": "system",
        "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
    },
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_zh.wav",
            }
        ]
    }
]
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True, speaker_embedding=model.default_speaker_embedding)
torchaudio.save("result_default_speaker.wav", wav_response.cpu(), 24000, format="wav")

Use custom speaker to generate output audio, similar to sound cloning.

messages = [
    {
        "role": "system",
        "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
    },
    {
        'role': "user",
        'content': [
            {
                "type": "audio",
                "audio": "assets/hello_zh.wav",
            }
        ]
    }
]
speaker_embedding = model.extract_speaker_embedding("assets/hello_zh.wav")
response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True, speaker_embedding=speaker_embedding)
torchaudio.save("result_custom_speaker.wav", wav_response.cpu(), 24000, format="wav")

Evaluation

InteractiveOmni achieves state-of-the-art performance across a wide range of multi-modal understanding and speech generation benchmarks.

Image Understanding

Model	MMBench	MMStar	MMMU	MathVista	HallusionBench	AI2D	OCRBench	Avg
Vision-Language Model
InternVL3-8B	82.1	68.7	62.2	70.5	49.0	85.1	88.4	72.3
InternVL3.5-8B	79.5	69.3	73.4	78.4	54.5	84.0	84.0	74.7
Qwen2.5-VL-7B	82.2	64.1	58.0	68.1	51.9	84.3	88.8	71.1
Omni Model
GPT-4o-mini	76.0	54.8	60.0	52.5	46.1	77.8	78.5	63.7
VITA-1.5	76.8	60.2	52.6	66.2	44.6	79.2	74.1	64.8
Ming-Lite-Omni	80.8	64.7	56.3	71.6	55.0	83.1	88.4	71.4
Qwen2.5-Omni-7B	81.3	64.0	59.2	67.9	47.4	83.2	83.4	69.5
InteractiveOmni-4B	78.9	62.6	61.1	61.7	52.2	83.8	80.0	68.6
InteractiveOmni-8B	81.4	66.8	66.9	68.0	61.3	84.3	83.7	73.2

Video Understanding

Model	Video-MME (wo sub)	Video-MME (w sub)	MLVU (M-Avg)	LongVideoBench (val total)	Avg
Vision-Language Model
InternVL3-8B	66.3	68.9	71.4	58.8	66.4
InternVL3.5-8B	66.0	68.6	70.2	62.1	66.7
Qwen2.5-VL-7B	65.1	71.6	70.2	56.0	64.5
Omni Model
GPT-4o-mini	64.8	-	-	-	-
Qwen2.5-Omni-7B	64.3	72.4	-	-	-
InteractiveOmni-4B	63.3	69.3	68.0	57.0	64.4
InteractiveOmni-8B	66.0	71.8	71.6	59.1	67.1

Audio Understanding

Model	Qwen2-Audio	Step-Audio-Chat	Kimi-Audio	Qwen2.5-Omni-7B	InteractiveOmni-4B	InteractiveOmni-8B
ASR (wer)
Wenetspeech test-net	10.60	8.75	5.37	5.90	5.40	5.04
Wenetspeech test-meeting	10.68	9.52	6.28	7.70	6.95	5.55
LibriSpeech test-clean	1.60	3.19	1.28	1.80	1.73	1.64
LibriSpeech test-other	3.60	10.67	2.42	3.40	3.69	3.41
Aishell-2 IOS	4.48	3.57	2.56	2.56	2.85	2.18
ChildMandarin	14.62	-	-	19.34	17.21	14.03
Audio Understanding
MMAU	56.60	-	65.20	65.60	72.00	67.39
MELD	55.30	33.54	59.13	57.00	57.16	57.55
ClothoAQA dev	72.63	44.98	73.18	73.12	71.91	72.98
ClothoAQA test	71.73	45.84	71.24	72.86	71.28	74.49

Omni-modal Understanding

Model	Speech	Sound Event	Music	Avg
OmniBench
MiniCPM-o-2.6	-	-	-	40.50
Baichuan-Omni-1.5	-	-	-	42.90
Qwen2.5-Omni-7B	55.25	60.00	52.83	56.13
InteractiveOmni-4B	60.70	61.51	42.45	59.19
InteractiveOmni-8B	60.18	62.64	55.66	60.33

Speech-to-text

Datasets	Model	Performance
OpenAudioBench Reasoning QA \| Llama Questions \| Web Questions \| TriviaQA \| AlpacaEval \| Avg	Qwen2-Audio	42.77 \| 69.67 \| 45.20 \| 40.30 \| 57.19 \| 51.03
	GLM-4-Voice	47.43 \| 76.00 \| 55.40 \| 51.80 \| 57.89 \| 57.70
	VITA-1.5	41.00 \| 74.20 \| 57.30 \| 46.80 \| 68.20 \| 57.50
	Step-Audio-chat	60.00 \| 72.33 \| 73.00 \| 56.80 \| 56.53 \| 63.73
	Baichuan-Audio	41.90 \| 78.40 \| 64.50 \| 61.70 \| 77.40 \| 64.78
	Kimi-Audio	58.02 \| 79.33 \| 70.20 \| 62.10 \| 75.73 \| 69.08
	MiniCPM-o-2.6	38.60 \| 77.80 \| 68.60 \| 61.90 \| 51.80 \| 59.74
	Baichuan-Omni-1.5	50.00 \| 78.50 \| 59.10 \| 57.20 \| 77.90 \| 64.54
	Qwen2.5-Omni-7B	63.76 \| 75.33 \| 62.80 \| 57.06 \| 72.76 \| 66.34
	InteractiveOmni-4B	69.11 \| 79.33 \| 65.80 \| 56.40 \| 74.87 \| 69.10
	InteractiveOmni-8B	71.68 \| 80.67 \| 70.30 \| 66.50 \| 74.57 \| 72.74
VoiceBench AlpacaEval \| CommonEval \| WildVoice \| SD-QA \| MMSU	Qwen2-Audio	3.69 \| 3.40 \| 3.01 \| 35.35 \| 35.43
	GLM-4-Voice	4.06 \| 3.48 \| 3.18 \| 43.31 \| 40.11
	VITA-1.5	4.21 \| 3.66 \| 3.48 \| 38.88 \| 52.15
	Step-Audio-chat	3.99 \| 2.99 \| 2.93 \| 46.84 \| 28.72
	Baichuan-Audio	4.41 \| 4.08 \| 3.92 \| 45.84 \| 53.19
	Kimi-Audio	4.46 \| 3.97 \| 4.20 \| 63.12 \| 62.17
	MiniCPM-o-2.6	4.42 \| 4.15 \| 3.94 \| 50.72 \| 54.78
	Baichuan-Omni-1.5	4.50 \| 4.05 \| 4.06 \| 43.40 \| 57.25
	Qwen2.5-Omni-7B	4.50 \| 3.84 \| 3.89 \| 56.40 \| 61.32
	InteractiveOmni-4B	4.27 \| 4.20 \| 3.94 \| 41.41 \| 63.24
	InteractiveOmni-8B	4.61 \| 4.34 \| 4.21 \| 44.67 \| 65.26
VoiceBench OpenBookQA \| IFEval \| BBH \| AdvBench \| Avg	Qwen2-Audio	49.01 \| 54.70 \| 22.57 \| 98.85 \| 55.32
	GLM-4-Voice	52.97 \| 52.80 \| 24.91 \| 88.08 \| 57.40
	VITA-1.5	71.65 \| 55.30 \| 38.14 \| 97.69 \| 64.53
	Step-Audio-chat	31.87 \| 50.60 \| 29.19 \| 65.77 \| 50.13
	Baichuan-Audio	71.65 \| 54.80 \| 50.31 \| 99.42 \| 69.27
	Kimi-Audio	83.52 \| 69.70 \| 61.10 \| 100.0 \| 76.91
	MiniCPM-o-2.6	78.02 \| 60.40 \| 49.25 \| 97.69 \| 71.23
	Baichuan-Omni-1.5	74.51 \| 62.70 \| 54.54 \| 97.31 \| 71.32
	Qwen2.5-Omni-7B	80.90 \| 66.70 \| 53.50 \| 99.20 \| 73.60
	InteractiveOmni-4B	82.64 \| 55.90 \| 60.90 \| 99.62 \| 73.10
	InteractiveOmni-8B	86.37 \| 73.30 \| 57.99 \| 99.42 \| 76.69

Speech Generation

Model	test-zh	test-en	test-zh-hard
TTS Model
MaskGCT	2.27	2.62	10.27
SeedTTS	1.12	2.25	7.59
CosyVoice 2	1.45	2.57	6.83
MLLM
MinMo	2.48	2.90	-
Ming-Lite-Omni	1.69	4.31	-
Qwen2.5-Omni-7B	1.70	2.72	7.97
InteractiveOmni-4B	1.37	3.73	8.02
InteractiveOmni-8B	1.56	2.33	7.92

Acknowledgements

We would like to thank the following projects and individuals for their contributions to the development of InteractiveOmni:

Citation

If you find our paper and code useful in your research, please cite our technical report.

@misc{tong2025interactiveomniunifiedomnimodalmodel,
      title={InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue}, 
      author={Wenwen Tong and Hewei Guo and Dongchuan Ran and Jiangnan Chen and Jiefan Lu and Kaibin Wang and Keqiang Li and Xiaoxu Zhu and Jiakui Li and Kehan Li and Xueheng Li and Lumin Li and Chenxu Guo and Jiasheng Zhou and Jiandong Chen and Xianye Wu and Jiahao Wang and Silei Wu and Lei Chen and Hanming Deng and Yuxuan Song and Dinghao Zhou and Guiping Zhong and Ken Zheng and Shiyin Kang and Lewei Lu},
      year={2025},
      eprint={2510.13747},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.13747}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

InteractiveOmni

News

Contents

Introduction

Key Features

Model Architecture

Quickstart

Get the Code

Model Loading

Inference with Transformers

Use audio output

Evaluation

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

License

SenseTime-FVG/InteractiveOmni

Folders and files

Latest commit

History

Repository files navigation

InteractiveOmni

News

Contents

Introduction

Key Features

Model Architecture

Quickstart

Get the Code

Model Loading

Inference with Transformers

Use audio output

Evaluation

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Packages