Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang

Uni-MoE-v2 represents our latest iteration of MoE-based unified multimodal model, designed to adeptly manage a spectrum of modalities such as audio, speech, images, text, and video. This cutting-edge framework boasts enhanced capabilities for multi-GPU training and inferencing, significantly accelerating the optimization process and expanding the scale of our model.

🌟 Structure

The model architecture of Uni-MoE is shown below. Three training stages contain: 1) Utilize pairs from different modalities and languages to build connectors that map these elements to a unified language space, establishing a foundation for multimodal understanding; 2) Develop modality-specific experts using cross-modal data to ensure deep understanding, preparing for a cohesive multi-expert model; 3) Incorporate multiple trained experts into LLMs and refine the unified multimodal model using the LoRA technique on mixed multimodal data.

In the V2 edition of our model, we have integrated the DeepSpeed MoE architecture to facilitate the efficient distribution of experts' weights across various GPUs during both training and testing phases. This strategic design ensures balanced load allocation and enhanced parallel processing capabilities.Furthermore, we have introduced a novel LoRA integrated MLP to optimize the distribution mechanism to reduces computational complexity while ensuring the distribution function of DeepSpeed MOE still works.

⚡️ Install

The following instructions are for Linux installation. We would like to recommend the requirements as follows.

Python == 3.9.16
CUDA Version >= 11.7

Clone this repository and navigate to the Uni_MoE_v2 folder

git clone https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.git
cd UMOE-Scaling-Unified-Multimodal-LLMs/Uni_MoE_v2

Install Package

conda create -n unimoe_v2 python==3.9.16
conda activate unimoe_v2
pip install -r env.txt
conda install mpi4py
pip install flash-attn==2.5.6
pip install moviepy

Replace all the absolute pathnames '/path/to/' or '/data/' with your specific path to the Uni-MoE file

(Including all the eval_x.py/inference_x.py/train_mem_x.py/data.py/demo.py files and config.json files from the model weights)

⚡️ Uni-MOE Weights

To use our new version model, all weights should be downloaded.

After downloading all of them, organize the weights as follows in 'Uni_MoE/checkpoint' folder:

└── checkpoint
    ├── Uni_MoE_v2_Experts
    ├── Uni-MoE-speech-base
    ├── Uni_MoE_v2_e2
    ├── clip-vit-large-patch14-336
    ├── whisper-small
    └── BEATs_iter3_plus_AS2M.pt

Model	Checkpoint
vision encoder	CLIP ViT-L/14 336px
speech encoder	whisper small
audio encoder	BEATs_iter3+ (AS2M)
Uni-MoE 8-expert base	Uni-MoE-speech-base
Uni_MoE 8-expert experts	Uni_MoE_v2_Experts
Uni_MoE 8-expert finetune model	Uni_MoE_v2_e2

Uni_MoE_v2_e2 is trained using Uni MoE Speech v2 dataset which add llava-665K for better image-text instruction tuning compared with MoE-Task2.

🗝️ Dataset

Training Data

DataSet	Type
LLaVA-Instruct-665K	imgae(coco-train2017)(etc)
LLaVA-Instruct-150K	imgae(train2014)
Video-Instruct-Dataset	video(from youtube)
RACE	Speech(TTS)
LibriSpeech	Speech(Long)

We use TTS technical to convert long text to speech to construct long speech understanding data.

Evaluation Data

DataSet	Input Type
AOKVQA	Text-Image
OKVQA	Text-Image
VQAv2	Text-Image
MMBench	Text-Image
POPE	Text-Image
TextVQA	Text-Image
MM-Vet	Text-Image
SEEDBench(Image)	Text-Image
MMBench-Audio	Text-Image-Speech(Long)
English-High-School-Listening	Text-Speech(Long)
RACE	Text-Speech(Long)
MSVD	Text-Video-Audio
Activitynet-QA	Text-Video-Audio

College Entrance English Examination Listening Part

We build a real speech understanding dataset to check the practical long speech recognition capabilities: English-High-School-Listening It comprises 150 questions related to long audio segments with an average length of 109 seconds, and 50 questions about short audio segments with an average length of 14 seconds.

🌈 How to infer and deploy your demo

Make sure that all the weights are downloaded and the running environment is set correctly.
run inference scripts inference_speech.sh using bash inference_speech.sh or run the following commands to inference:
NOTE: 8-experts model share the same Uni-MoE-speech-base, remember to replace the content of config.json with 8config.json before inference.

cd /path/to/Uni_MoE_v2
conda activate unimoe_v2
export MASTER_PORT=10079
export GPUS_PER_NODE=2

deepspeed --num_gpus=2 --num_nodes=1 \
    --master_addr "localhost" --master_port $MASTER_PORT \
    Uni_MoE_speech/inference_new.py \
    --deepspeed ./scripts/zero2.json \
    --model_base path/to/Uni-MoE-speech-base \
    --model_path output/Uni_MoE_v2_e2 \
    --data_path /path/to/eval.json \
    --enable_deepspeed_moe True \
    --data_type vqa\
    --eval_ep_size 2 \
    --mlp_dir path/to/Uni_MoE_v2_Experts\
    --version v1 \
    --vision_tower path/to/clip-vit-large-patch14-336 \
    --audio_tower path/to/whisper-small \
    --output_dir Uni_MoE_speech_output

🌈 How to train and evaluate on datasets

Training:

Make sure that all the weights are downloaded and the environment is set correctly, especially for the base model.
Make sure that all the data are downloaded and pre-processed utilizing data_add_tokens_release.py.
Run training scripts: train_deepspeed_8moe_release1.slurm using bash train_deepspeed_8moe_release1.slurm sbatch train_deepspeed_8moe_release1.slurm, remember to modify the training set with your own preference.

Evaluation:

Prepare the evaluation set using the form as samples.json.
Run evaluation scripts: eval_speech.sh using bash eval_speech.sh or run the following commands to eval:
NOTE: 8-experts model share the same Uni-MoE-speech-base, remember to replace the content of config.sjon with 8config.json before evaluation.

cd path/to/Uni_MoE_v2
conda activate unimoe_v2

deepspeed --num_gpus=2 --num_nodes=1 \
    --master_addr "localhost" --master_port $MASTER_PORT \
    Uni_MoE_speech/eval.py \
    --deepspeed ./scripts/zero2.json \
    --model_base checkpoints/Uni-MoE-speech-base \
    --model_path output/Uni_MoE_v2_e2 \
    --data_path path/to/eval.json \
    --enable_deepspeed_moe True \
    --data_type vqa\
    --eval_ep_size 2 \
    --mlp_dir path/to/Uni_MoE_v2_Experts\
    --version v1 \
    --vision_tower checkpoints/clip-vit-large-patch14-336 \
    --audio_tower checkpoints/whisper-small \
    --output_dir Uni_MoE_speech_eval_out.json

We recommend using 2x80GB GPU RAM to run all experiments.

Citation

If you find Uni-MoE useful for your research and applications, please cite using this BibTeX:

@article{li2024uni,
  title={Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts},
  author={Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min},
  journal={arXiv preprint arXiv:2405.11273},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

🌟 Structure

⚡️ Install

⚡️ Uni-MOE Weights

🗝️ Dataset

Training Data

Evaluation Data

College Entrance English Examination Listening Part

🌈 How to infer and deploy your demo

🌈 How to train and evaluate on datasets

Citation

Files

README.md

Latest commit

History

README.md

File metadata and controls

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

🌟 Structure

⚡️ Install

⚡️ Uni-MOE Weights

🗝️ Dataset

Training Data

Evaluation Data

College Entrance English Examination Listening Part

🌈 How to infer and deploy your demo

🌈 How to train and evaluate on datasets

Citation