Skip to content

Latest commit

 

History

History
181 lines (143 loc) · 10 KB

File metadata and controls

181 lines (143 loc) · 10 KB

Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts

Yunxin Li, Shenyuan Jiang, Baotian Hu, Longyue Wang, Wanqi Zhong, Wenhan Luo, Lin Ma, Min Zhang

Uni-MoE-v2 represents our latest iteration of MoE-based unified multimodal model, designed to adeptly manage a spectrum of modalities such as audio, speech, images, text, and video. This cutting-edge framework boasts enhanced capabilities for multi-GPU training and inferencing, significantly accelerating the optimization process and expanding the scale of our model.

🌟 Structure

The model architecture of Uni-MoE is shown below. Three training stages contain: 1) Utilize pairs from different modalities and languages to build connectors that map these elements to a unified language space, establishing a foundation for multimodal understanding; 2) Develop modality-specific experts using cross-modal data to ensure deep understanding, preparing for a cohesive multi-expert model; 3) Incorporate multiple trained experts into LLMs and refine the unified multimodal model using the LoRA technique on mixed multimodal data.

In the V2 edition of our model, we have integrated the DeepSpeed MoE architecture to facilitate the efficient distribution of experts' weights across various GPUs during both training and testing phases. This strategic design ensures balanced load allocation and enhanced parallel processing capabilities.Furthermore, we have introduced a novel LoRA integrated MLP to optimize the distribution mechanism to reduces computational complexity while ensuring the distribution function of DeepSpeed MOE still works.

⚡️ Install

The following instructions are for Linux installation. We would like to recommend the requirements as follows.

  • Python == 3.9.16
  • CUDA Version >= 11.7
  1. Clone this repository and navigate to the Uni_MoE_v2 folder
git clone https://github.com/HITsz-TMG/UMOE-Scaling-Unified-Multimodal-LLMs.git
cd UMOE-Scaling-Unified-Multimodal-LLMs/Uni_MoE_v2
  1. Install Package
conda create -n unimoe_v2 python==3.9.16
conda activate unimoe_v2
pip install -r env.txt
conda install mpi4py
pip install flash-attn==2.5.6
pip install moviepy
  1. Replace all the absolute pathnames '/path/to/' or '/data/' with your specific path to the Uni-MoE file

(Including all the eval_x.py/inference_x.py/train_mem_x.py/data.py/demo.py files and config.json files from the model weights)

⚡️ Uni-MOE Weights

To use our new version model, all weights should be downloaded.

After downloading all of them, organize the weights as follows in 'Uni_MoE/checkpoint' folder:

└── checkpoint
    ├── Uni_MoE_v2_Experts
    ├── Uni-MoE-speech-base
    ├── Uni_MoE_v2_e2
    ├── clip-vit-large-patch14-336
    ├── whisper-small
    └── BEATs_iter3_plus_AS2M.pt
Model Checkpoint
vision encoder CLIP ViT-L/14 336px
speech encoder whisper small
audio encoder BEATs_iter3+ (AS2M)
Uni-MoE 8-expert base Uni-MoE-speech-base
Uni_MoE 8-expert experts Uni_MoE_v2_Experts
Uni_MoE 8-expert finetune model Uni_MoE_v2_e2
  • Uni_MoE_v2_e2 is trained using Uni MoE Speech v2 dataset which add llava-665K for better image-text instruction tuning compared with MoE-Task2.

🗝️ Dataset

Training Data

DataSet Type
LLaVA-Instruct-665K imgae(coco-train2017)(etc)
LLaVA-Instruct-150K imgae(train2014)
Video-Instruct-Dataset video(from youtube)
RACE Speech(TTS)
LibriSpeech Speech(Long)

We use TTS technical to convert long text to speech to construct long speech understanding data.

Evaluation Data

DataSet Input Type
AOKVQA Text-Image
OKVQA Text-Image
VQAv2 Text-Image
MMBench Text-Image
POPE Text-Image
TextVQA Text-Image
MM-Vet Text-Image
SEEDBench(Image) Text-Image
MMBench-Audio Text-Image-Speech(Long)
English-High-School-Listening Text-Speech(Long)
RACE Text-Speech(Long)
MSVD Text-Video-Audio
Activitynet-QA Text-Video-Audio

College Entrance English Examination Listening Part

We build a real speech understanding dataset to check the practical long speech recognition capabilities: English-High-School-Listening It comprises 150 questions related to long audio segments with an average length of 109 seconds, and 50 questions about short audio segments with an average length of 14 seconds.

🌈 How to infer and deploy your demo

  1. Make sure that all the weights are downloaded and the running environment is set correctly.
  2. run inference scripts inference_speech.sh using bash inference_speech.sh or run the following commands to inference:
  3. NOTE: 8-experts model share the same Uni-MoE-speech-base, remember to replace the content of config.json with 8config.json before inference.
cd /path/to/Uni_MoE_v2
conda activate unimoe_v2
export MASTER_PORT=10079
export GPUS_PER_NODE=2

deepspeed --num_gpus=2 --num_nodes=1 \
    --master_addr "localhost" --master_port $MASTER_PORT \
    Uni_MoE_speech/inference_new.py \
    --deepspeed ./scripts/zero2.json \
    --model_base path/to/Uni-MoE-speech-base \
    --model_path output/Uni_MoE_v2_e2 \
    --data_path /path/to/eval.json \
    --enable_deepspeed_moe True \
    --data_type vqa\
    --eval_ep_size 2 \
    --mlp_dir path/to/Uni_MoE_v2_Experts\
    --version v1 \
    --vision_tower path/to/clip-vit-large-patch14-336 \
    --audio_tower path/to/whisper-small \
    --output_dir Uni_MoE_speech_output

🌈 How to train and evaluate on datasets

Training:

  1. Make sure that all the weights are downloaded and the environment is set correctly, especially for the base model.
  2. Make sure that all the data are downloaded and pre-processed utilizing data_add_tokens_release.py.
  3. Run training scripts: train_deepspeed_8moe_release1.slurm using bash train_deepspeed_8moe_release1.slurm sbatch train_deepspeed_8moe_release1.slurm, remember to modify the training set with your own preference.

Evaluation:

  1. Prepare the evaluation set using the form as samples.json.
  2. Run evaluation scripts: eval_speech.sh using bash eval_speech.sh or run the following commands to eval:
  3. NOTE: 8-experts model share the same Uni-MoE-speech-base, remember to replace the content of config.sjon with 8config.json before evaluation.
cd path/to/Uni_MoE_v2
conda activate unimoe_v2

deepspeed --num_gpus=2 --num_nodes=1 \
    --master_addr "localhost" --master_port $MASTER_PORT \
    Uni_MoE_speech/eval.py \
    --deepspeed ./scripts/zero2.json \
    --model_base checkpoints/Uni-MoE-speech-base \
    --model_path output/Uni_MoE_v2_e2 \
    --data_path path/to/eval.json \
    --enable_deepspeed_moe True \
    --data_type vqa\
    --eval_ep_size 2 \
    --mlp_dir path/to/Uni_MoE_v2_Experts\
    --version v1 \
    --vision_tower checkpoints/clip-vit-large-patch14-336 \
    --audio_tower checkpoints/whisper-small \
    --output_dir Uni_MoE_speech_eval_out.json

We recommend using 2x80GB GPU RAM to run all experiments.

Citation

If you find Uni-MoE useful for your research and applications, please cite using this BibTeX:

@article{li2024uni,
  title={Uni-MoE: Scaling Unified Multimodal LLMs with Mixture of Experts},
  author={Li, Yunxin and Jiang, Shenyuan and Hu, Baotian and Wang, Longyue and Zhong, Wanqi and Luo, Wenhan and Ma, Lin and Zhang, Min},
  journal={arXiv preprint arXiv:2405.11273},
  year={2024}
}