OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Size Wu, Zhonghua Wu, Zerui Gong, Qingyi Tao, Sheng Jin, Qinyue Li, Wei Li, Chen Change Loy

Introduction

This is a repo under construction, named OpenUni, an open-source version of MetaQuery for unifying multimodal understanding and generation. With a minimalist choice of architecture, we demonstrate that OpenUni can: 1) generate high-quality and instruction-aligned images, and 2) achieve exceptional performance on standard benchmarks such as GenEval, DPG-Bench, and WISE, with only 1.1B and 3.1B activated parameters. Currently, we provide three model variants: OpenUni-B-512, OpenUni-L-512 and OpenUni-L-1024. Checkpoints from both pre-training and fine-tuning are provided.

🔥 Model Zoo

Model Name	Image Size	MLMM Model	Diffusion Model	Pre-trained	Fine-tuned
OpenUni-B-512	512×512	InternVL3-1B	SANA-0.6B-512px	Link	Link
OpenUni-L-512	512×512	InternVL3-2B	SANA-1.6B-512px	Link	Link
OpenUni-L-1024	1024×1024	InternVL3-2B	SANA1.5-1.6B-1024px	Link	Link

Environment

mmengine
xtuner
transformers
torch
flash_attn

Text-to-Image

Please download our released model weights from 🤗wusize/openuni. It is recommended to use the following command to download the checkpoints

# pip install -U "huggingface_hub[cli]"
huggingface-cli download wusize/openuni  --local-dir checkpoints --repo-type model

OpenUni/
├── checkpoints
    ├── openuni_b_internvl3_1b_sana_0_6b_512_hf_blip3o60k.pth
    ├── openuni_b_internvl3_1b_sana_0_6b_512_hf_text2image23m.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_1024_hf_blip3o60k.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_1024_hf_text2image23m.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_512_hf_blip3o60k.pth
    ├── openuni_l_internvl3_2b_sana_1_6b_512_hf_text2image23m.pth

Inference

Please refer to docs/INFERENCE.md.

Evaluation

Please refer to docs/EVALUATION.md.

Train

Please refer to docs/DATASETS.md and docs/datasets to prepare the datasets. After having the datasets, please follow the instructions in docs/TRAIN.md to launch training scripts.

📚 Citation

If you find OpenUni useful for your research or applications, please cite our paper using the following BibTeX:

@article{wu2025openuni,
      title={OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation}, 
      author={Size Wu and Zhonghua Wu and Zerui Gong and Qingyi Tao and Sheng Jin and Qinyue Li and Wei Li and Chen Change Loy},
      year={2025},
      eprint={2505.23661},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.23661}, 
}

📜 License

This project is licensed under NTU S-Lab License 1.0.

🙏 Acknowledgement

The project builds upon the following pioneering works:

SANA: We use SANA as our diffusion module, considering its efficiency and strong performance.
InternVL3: We use the latest InternVL3 as our base multimodal LLM.
MetaQuery: OpenUni is inspired by MetaQuery and is an open-source implementation of this work.
BLIP3-o: We thank the BLIP3-o team for releasing the precious high-quality tuning dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
configs		configs
docs		docs
figures		figures
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Introduction

🔥 Model Zoo

Environment

Text-to-Image

Inference

Evaluation

Train

📚 Citation

📜 License

🙏 Acknowledgement

About

Uh oh!

Releases

Packages

Languages

License

wusize/OpenUni

Folders and files

Latest commit

History

Repository files navigation

OpenUni: A Simple Baseline for Unified Multimodal Understanding and Generation

Introduction

🔥 Model Zoo

Environment

Text-to-Image

Inference

Evaluation

Train

📚 Citation

📜 License

🙏 Acknowledgement

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages