Skip to content

wusize/OpenUni

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

report Bibtex

Introduction

This is a repo under construction, named OpenUni, an open-source version of MetaQuery for unifying multimodal understanding and generation. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG-Bench, and WISE, with only 1.1B and 3.1B activated parameters. Currently, we provide three model variants: OpenUni-B-512, OpenUni-L-512 and OpenUni-L-1024. Checkpoints from both pre-training and fine-tuning are provided.

🔥 Model Zoo

Model Name Image Size MLMM Model Diffusion Model Pre-trained Fine-tuned
OpenUni-B-512 512×512 InternVL3-1B SANA-0.6B-512px Link Link
OpenUni-L-512 512×512 InternVL3-2B SANA-1.6B-512px Link Link
OpenUni-L-1024 1024×1024 InternVL3-2B SANA1.5-1.6B-1024px Link Link

Environment

mmengine
xtuner
transformers
torch
flash_attn

Text-to-Image

Please download our released model weights from 🤗wusize/openuni. It is recommended to use the following command to download the checkpoints

# pip install -U "huggingface_hub[cli]"
huggingface-cli download wusize/openuni  --local-dir checkpoints --repo-type model
OpenUni/
├── checkpoints
    ├── openuni_b_internvl3_1b_sana_0_6b_512_hf_blip3o60k.pth
    ├── openuni_b_internvl3_1b_sana_0_6b_512_hf_text2image23m.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_1024_hf_blip3o60k.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_1024_hf_text2image23m.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_512_hf_blip3o60k.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_512_hf_text2image23m.pth

Inference

Please refer to docs/INFERENCE.md.

Evaluation

Please refer to docs/EVALUATION.md.

Train

Please refer to docs/DATASETS.md and docs/datasets to prepare the datasets. After having the datasets, please follow the instructions in docs/TRAIN.md to launch training scripts.

📚 Citation

If you find OpenUni useful for your research or applications, please cite our paper using the following BibTeX:

@article{wu2025openuni,
      title={OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Zhonghua Wu and Zerui Gong and Qingyi Tao and Sheng Jin and Qinyue Li and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2505.23661},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23661}, 
}

📜 License

This project is licensed under NTU S-Lab License 1.0.

🙏 Acknowledgement

The project builds upon the following pioneering works:

  • SANA: We use SANA as our diffusion module, considering its efficiency and strong performance.
  • InternVL3: We use the latest InternVL3 as our base multimodal LLM.
  • MetaQuery: OpenUni is inspired by MetaQuery and is an open-source implementation of this work.
  • BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published