This repository is the official implementation of SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models. We introduce SoundMind, a novel rule-based reinforcement learning framework that empowers largescale audio-language models with advanced logical reasoning capabilities across both audio and textual modalities. To enable such training, we build the Audio Logical Reasoning (ALR) dataset, a dual-modality benchmark comprising 6,446 highquality samples annotated with chain-of-thought reasoning in both audio and text forms.
To download our dataset, please visit this link: Dataset Link
Run the following command:
wget -c "https://www.dropbox.com/scl/fi/irtbrnmk5e0ecvv8fyrum/audio_dataset.zip?rlkey=p1ebkt9h1bkyjsq3fo2bp667v&st=gxr542e2&dl=1" -O audio_dataset.zip
Alternatively, you can also download it from Hugging Face.
The dataset contains train, test, and validation splits with corresponding text descriptions and metadata stored as JSON files. All annotation files are located in the dataset-annotation-json
folder in this github.
8× NVIDIA H800 80GB or 8× NVIDIA H100 80GB GPUs.
Our codebase is based on verl. If you are already familiar with verl, you should be able to quickly get started with this repository.
- Python: Version >= 3.9
- CUDA: Version >= 12.1
For training and inference engines to utilize better and faster hardware support, CUDA/cuDNN and other dependencies are required, and some of the dependencies are easy to be overridden when installing other packages.
We need to install the following pre-requisites:
- CUDA: Version >= 12.4
- cuDNN: Version >= 9.8.0
# change directory to anywhere you like, in verl source code directory is not recommended
wget https://developer.download.nvidia.com/compute/cudnn/9.8.0/local_installers/cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
dpkg -i cudnn-local-repo-ubuntu2204-9.8.0_1.0-1_amd64.deb
cp /var/cudnn-local-repo-ubuntu2204-9.8.0/cudnn-*-keyring.gpg /usr/share/keyrings/
apt-get update
apt-get -y install cudnn-cuda-12
Create and activate a new conda environment:
conda create -n alr python==3.10
conda activate alr
Install verl:
bash scripts/install_vllm_sglang_mcore.sh
pip install --no-deps -e .
Please make sure that the installed packages are not overridden during the installation of other packages.
The packages worth checking are:
- torch and torch series
- vLLM
- SGLang
- pyarrow
- tensordict
For Qwen2.5-Omni, we need to update some additional library versions.
pip install transformers==4.52.3
pip install accelerate
pip install qwen-omni-utils[decord] -U
Our project and code rely on Audio Logical Reasoning (ALR) dataset.
- Option 1: Two modal inputs are used
cd ./examples/data_preprocess
python alr.py
- Option 2: Only texts are used
cd ./examples/data_preprocess
python alr_text.py
- Option 3: Only audio is used
cd ./examples/data_preprocess
python alr_audio.py
To download our model checkpoint, please visit this link: Checkpoint Link
Run the following command:
wget -c "https://www.dropbox.com/scl/fi/f24wyecnycfu6g6ip10ac/qwen2_5_omni_logic.zip?rlkey=xlixctyr8cbfpv85arhka0b8c&st=wd5rlh9b&dl=1" -O qwen2_5_omni_logic.zip
If you don't want to use the pre-trained model we provided, you can use the official version. You can change the model path implementation in download_qwen25omni.py and main_grpo.sh.
Run the following command:
python download_qwen25omni.py
bash main_grpo.sh
If you think this project is helpful, please feel free to leave a star⭐️ and cite our paper:
@article{soundmind,
title={SoundMind: RL-Incentivized Logic Reasoning for Audio-Language Models},
author={Diao, Xingjian and Zhang, Chunhui and Kong, Keyi and Wu, Weiyi and Ma, Chiyu and Ouyang, Zhongyu and Qing, Peijun and Vosoughi, Soroush and Gui, Jiang},
journal={arXiv preprint arXiv:2506.12935},
year={2025}
}