We present OpenS2S, a fully open-source, transparent and end-to-end large speech language model designed to enable empathetic speech interactions.
As shown in the figure, OpenS2S consists of the following main components:
-
Audio Encoder: The Audio Encoder is responsible for transforming this raw audio signal into a more manageable and meaningful representation.
-
Instruction-Following LLM: The audio embeddings and text embeddings are concatenated to form interleaved input sequences for the large language model. We select Qwen3-8B-Instruct as the LLM, leveraging its robust text processing capabilities.
-
Streaming Speech Decoder: The speech response is first converted into discrete tokens using a supervised semantic speech tokenizer. Then, an autoregressive text-to-speech language model is used to generate speech tokens conditioned on the hidden states of the LLM, enabling real-time generation.
More examples can be found in the project page.
pip install -r requirements.txt- Prepare the pretrained OpenS2S checkpoint
Download the pretrained OpenS2S model from CASIA-LM/OpenS2S.
- Prepare the Token2Wav Decoder
Download the decoder model from THUDM/glm-4-voice-decoder.
- Start the controller
python controller.py- Start the model server
python model_worker.py --model-path your_opens2s_path --flow-path your_decoder_path- Launching web service locally
python web_demo.py --port 8888This code requires input data to be in JSON Lines (jsonl) format. Each line of the file must be a valid JSON object containing exactly one key: messages.
Here is an example of a valid line in the jsonl file:
{
"messages": [
{
"role": "user",
"content": [
{"text": "continue the following sentence", "audio": "", "speech_units": "", "spk_emb": ""},
{"text": "", "audio": "/path/to/audio", "speech_units": "", "spk_emb": ""}
]
},
{
"role": "assistant",
"content": [
{"text": "hello", "audio": "", "speech_units": "<|audio_0|><|audio_1|>", "spk_emb": ""},
]
}
]
}If you want to construct continuation writing based on ASR data, please refer to text_generation.py. If you want to convert audio waveform into speech units, please refer to GLM-4-Voice.
-
Obtain the Audio Encoder, LLM bachbone, and Auto-regressive TTS LM.
-
Offline process training data
export llm_path=/path/to/llm_backbone
export tts_path=/path/to/ar_tts
export audio_path=path/to/audio_encoder
python src/instruction_dataset.py offline \
--dataroot /path/to/raw_data_dir \
--manifest_files "*.jsonl" \
--llm_path ${llm_path} \
--tts_path ${tts_path} \
--save_dir /path/to/processed_data_dir \
--num_proc 64- train the model (connect different modules)
export data_dir=/path/to/processed_data_dir
export SAVE_ROOT=/path/to/checkpoints
bash scripts/train_from_scratch.sh-
Obtain pretrained checkpoints
-
Offline process
export omnispeech_path=/path/to/omnispeech
python src/instruction_dataset.py offline \
--dataroot /path/to/raw_data_dir \
--manifest_files "*.jsonl" \
--llm_path ${omnispeech_path} \
--tts_path ${omnispeech_path}/tts/ \
--save_dir /path/to/processed_data_dir \
--num_proc 64- fine-tune the pretrained model
bash scripts/train_continue.shWe would like to thank the following projects and individuals for their contributions to the development of OpenS2S:
Thank you to all the open-source projects for their contributions to this project!
- The license of our project is Apache License 2.0
If you find our project useful, hope you can star our repo and cite our paper as follows:
@article{wang2025opens2s,
title={OpenS2S : Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model},
author={Wang Chen, Peng Tianyu, Yang Wen, Bai Yinan, Wang Guangfu, Lin Jun, Jia Lanpeng, Wu Lingxiang, Wang Jinqiao, Zong Chengqing, Zhang Jiajun},
journal={arXiv preprint arXiv:2507.05177},
year={2025}
}
