Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang
Uni-MoE-v2 represents our latest iteration of MoE-based unified multimodal model, designed to adeptly manage a spectrum of modalities such as audio, speech, images, text, and video. This cutting-edge framework boasts enhanced capabilities for multi-GPU training and inferencing, significantly accelerating the optimization process and expanding the scale of our model.
The model architecture of Uni-MoE is shown below. Three training stages contain: 1) Utilize pairs from different modalities and languages to build connectors that map these elements to a unified language space, establishing a foundation for multimodal understanding; 2) Develop modality-specific experts using cross-modal data to ensure deep understanding, preparing for a cohesive multi-expert model; 3) Incorporate multiple trained experts into LLMs and refine the unified multimodal model using the LoRA technique on mixed multimodal data.
In the V2 edition of our model, we have integrated the DeepSpeed MoE architecture to facilitate the efficient distribution of experts' weights across various GPUs during both training and testing phases. This strategic design ensures balanced load allocation and enhanced parallel processing capabilities.Furthermore, we have introduced a novel LoRA integrated MLP to optimize the distribution mechanism to reduces computational complexity while ensuring the distribution function of DeepSpeed MOE still works.
The following instructions are for Linux installation. We would like to recommend the requirements as follows.
- Python == 3.9.16
- CUDA Version >= 11.7
- Clone this repository and navigate to the Uni_MoE_v2 folder
git clone https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.git
cd UMOE-Scaling-Unified-Multimodal-LLMs/Uni_MoE_v2
- Install Package
conda create -n unimoe_v2 python==3.9.16
conda activate unimoe_v2
pip install -r env.txt
conda install mpi4py
pip install flash-attn==2.5.6
pip install moviepy
- Replace all the absolute pathnames '/path/to/' or '/data/' with your specific path to the Uni-MoE file
(Including all the eval_x.py/inference_x.py/train_mem_x.py/data.py/demo.py files and config.json files from the model weights)
To use our new version model, all weights should be downloaded.
After downloading all of them, organize the weights as follows in 'Uni_MoE/checkpoint' folder:
└── checkpoint
├── Uni_MoE_v2_Experts
├── Uni-MoE-speech-base
├── Uni_MoE_v2_e2
├── clip-vit-large-patch14-336
├── whisper-small
└── BEATs_iter3_plus_AS2M.pt
Model | Checkpoint |
---|---|
vision encoder | CLIP ViT-L/14 336px |
speech encoder | whisper small |
audio encoder | BEATs_iter3+ (AS2M) |
Uni-MoE 8-expert base | Uni-MoE-speech-base |
Uni_MoE 8-expert experts | Uni_MoE_v2_Experts |
Uni_MoE 8-expert finetune model | Uni_MoE_v2_e2 |
- Uni_MoE_v2_e2 is trained using Uni MoE Speech v2 dataset which add llava-665K for better image-text instruction tuning compared with MoE-Task2.
DataSet | Type |
---|---|
LLaVA-Instruct-665K | imgae(coco-train2017)(etc) |
LLaVA-Instruct-150K | imgae(train2014) |
Video-Instruct-Dataset | video(from youtube) |
RACE | Speech(TTS) |
LibriSpeech | Speech(Long) |
We use TTS technical to convert long text to speech to construct long speech understanding data.
DataSet | Input Type |
---|---|
AOKVQA | Text-Image |
OKVQA | Text-Image |
VQAv2 | Text-Image |
MMBench | Text-Image |
POPE | Text-Image |
TextVQA | Text-Image |
MM-Vet | Text-Image |
SEEDBench(Image) | Text-Image |
MMBench-Audio | Text-Image-Speech(Long) |
English-High-School-Listening | Text-Speech(Long) |
RACE | Text-Speech(Long) |
MSVD | Text-Video-Audio |
Activitynet-QA | Text-Video-Audio |
We build a real speech understanding dataset to check the practical long speech recognition capabilities: English-High-School-Listening It comprises 150 questions related to long audio segments with an average length of 109 seconds, and 50 questions about short audio segments with an average length of 14 seconds.
- Make sure that all the weights are downloaded and the running environment is set correctly.
- run inference scripts
inference_speech.sh
usingbash inference_speech.sh
or run the following commands to inference: - NOTE: 8-experts model share the same Uni-MoE-speech-base, remember to replace the content of
config.json
with8config.json
before inference.
cd /path/to/Uni_MoE_v2
conda activate unimoe_v2
export MASTER_PORT=10079
export GPUS_PER_NODE=2
deepspeed --num_gpus=2 --num_nodes=1 \
--master_addr "localhost" --master_port $MASTER_PORT \
Uni_MoE_speech/inference_new.py \
--deepspeed ./scripts/zero2.json \
--model_base path/to/Uni-MoE-speech-base \
--model_path output/Uni_MoE_v2_e2 \
--data_path /path/to/eval.json \
--enable_deepspeed_moe True \
--data_type vqa\
--eval_ep_size 2 \
--mlp_dir path/to/Uni_MoE_v2_Experts\
--version v1 \
--vision_tower path/to/clip-vit-large-patch14-336 \
--audio_tower path/to/whisper-small \
--output_dir Uni_MoE_speech_output
Training:
- Make sure that all the weights are downloaded and the environment is set correctly, especially for the base model.
- Make sure that all the data are downloaded and pre-processed utilizing
data_add_tokens_release.py
. - Run training scripts:
train_deepspeed_8moe_release1.slurm
usingbash train_deepspeed_8moe_release1.slurm
sbatch train_deepspeed_8moe_release1.slurm
, remember to modify the training set with your own preference.
Evaluation:
- Prepare the evaluation set using the form as
samples.json
. - Run evaluation scripts:
eval_speech.sh
usingbash eval_speech.sh
or run the following commands to eval: - NOTE: 8-experts model share the same Uni-MoE-speech-base, remember to replace the content of
config.sjon
with8config.json
before evaluation.
cd path/to/Uni_MoE_v2
conda activate unimoe_v2
deepspeed --num_gpus=2 --num_nodes=1 \
--master_addr "localhost" --master_port $MASTER_PORT \
Uni_MoE_speech/eval.py \
--deepspeed ./scripts/zero2.json \
--model_base checkpoints/Uni-MoE-speech-base \
--model_path output/Uni_MoE_v2_e2 \
--data_path path/to/eval.json \
--enable_deepspeed_moe True \
--data_type vqa\
--eval_ep_size 2 \
--mlp_dir path/to/Uni_MoE_v2_Experts\
--version v1 \
--vision_tower checkpoints/clip-vit-large-patch14-336 \
--audio_tower checkpoints/whisper-small \
--output_dir Uni_MoE_speech_eval_out.json
We recommend using 2x80GB GPU RAM to run all experiments.
If you find Uni-MoE useful for your research and applications, please cite using this BibTeX:
@article{li2024uni,
title={Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts},
author={Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min},
journal={arXiv preprint arXiv:2405.11273},
year={2024}
}