OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

We present OpenS2S, a fully open-source, transparent and end-to-end large speech language model designed to enable empathetic speech interactions.

Model Architecture

As shown in the figure, OpenS2S consists of the following main components:

Audio Encoder: The Audio Encoder is responsible for transforming this raw audio signal into a more manageable and meaningful representation.
Instruction-Following LLM: The audio embeddings and text embeddings are concatenated to form interleaved input sequences for the large language model. We select Qwen3-8B-Instruct as the LLM, leveraging its robust text processing capabilities.
Streaming Speech Decoder: The speech response is first converted into discrete tokens using a supervised semantic speech tokenizer. Then, an autoregressive text-to-speech language model is used to generate speech tokens conditioned on the hidden states of the LLM, enabling real-time generation.

Example

More examples can be found in the project page.

Usage

Setup

pip install -r requirements.txt

Prepare

Prepare the pretrained OpenS2S checkpoint

Download the pretrained OpenS2S model from CASIA-LM/OpenS2S.

Prepare the Token2Wav Decoder

Download the decoder model from THUDM/glm-4-voice-decoder.

Inference

Start the controller

python controller.py

Start the model server

python model_worker.py --model-path your_opens2s_path --flow-path your_decoder_path

Launching web service locally

python web_demo.py --port 8888

Training

Data Preparation

This code requires input data to be in JSON Lines (jsonl) format. Each line of the file must be a valid JSON object containing exactly one key: messages.

Here is an example of a valid line in the jsonl file:

{
    "messages": [
        {
            "role": "user", 
            "content": [
                {"text": "continue the following sentence", "audio": "", "speech_units": "", "spk_emb": ""},
                {"text": "", "audio": "/path/to/audio", "speech_units": "", "spk_emb": ""}
            ]
        },
        {
            "role": "assistant", 
            "content": [
                {"text": "hello", "audio": "", "speech_units": "<|audio_0|><|audio_1|>", "spk_emb": ""},
            ]
        }
    ]
}

If you want to construct continuation writing based on ASR data, please refer to text_generation.py. If you want to convert audio waveform into speech units, please refer to GLM-4-Voice.

Train from scratch

Obtain the Audio Encoder, LLM bachbone, and Auto-regressive TTS LM.
Offline process training data

export llm_path=/path/to/llm_backbone
export tts_path=/path/to/ar_tts
export audio_path=path/to/audio_encoder
python src/instruction_dataset.py offline \
    --dataroot /path/to/raw_data_dir \
    --manifest_files "*.jsonl" \
    --llm_path ${llm_path} \
    --tts_path ${tts_path} \
    --save_dir /path/to/processed_data_dir \
    --num_proc 64

train the model (connect different modules)

export data_dir=/path/to/processed_data_dir
export SAVE_ROOT=/path/to/checkpoints

bash scripts/train_from_scratch.sh

Fine-tuning

Obtain pretrained checkpoints
Offline process

export omnispeech_path=/path/to/omnispeech

python src/instruction_dataset.py offline \
    --dataroot /path/to/raw_data_dir \
    --manifest_files "*.jsonl" \
    --llm_path ${omnispeech_path} \
    --tts_path ${omnispeech_path}/tts/ \
    --save_dir /path/to/processed_data_dir \
    --num_proc 64

fine-tune the pretrained model

bash scripts/train_continue.sh

Acknowledgements

We would like to thank the following projects and individuals for their contributions to the development of OpenS2S:

Thank you to all the open-source projects for their contributions to this project!

License

The license of our project is Apache License 2.0

Citation

If you find our project useful, hope you can star our repo and cite our paper as follows:

@article{wang2025opens2s,
  title={OpenS2S : Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model},
  author={Wang Chen, Peng Tianyu, Yang Wen, Bai Yinan, Wang Guangfu, Lin Jun, Jia Lanpeng, Wu Lingxiang, Wang Jinqiao, Zong Chengqing, Zhang Jiajun},
  journal={arXiv preprint arXiv:2507.05177},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
cosyvoice		cosyvoice
ds_config		ds_config
figures		figures
scripts		scripts
src		src
third_party		third_party
.DS_Store		.DS_Store
README.md		README.md
controller.py		controller.py
flow_inference.py		flow_inference.py
model_worker.py		model_worker.py
requirements.txt		requirements.txt
text_generation.py		text_generation.py
train.py		train.py
web_demo.py		web_demo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Model Architecture

Example

Usage

Setup

Prepare

Inference

Training

Data Preparation

Train from scratch

Fine-tuning

Acknowledgements

License

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

CASIA-LM/OpenS2S

Folders and files

Latest commit

History

Repository files navigation

OpenS2S: Advancing Fully Open-Source End-to-End Empathetic Large Speech Language Model

Model Architecture

Example

Usage

Setup

Prepare

Inference

Training

Data Preparation

Train from scratch

Fine-tuning

Acknowledgements

License

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages