Skip to content

Empowering Unified MLLM with Multi-granular Visual Generation

License

Notifications You must be signed in to change notification settings

rongyaofang/PUMA

Repository files navigation

PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation

Home visitors

Rongyao Fang1*, Chengqi Duan2*, Kun Wang3, Hao Li1,4, Hao Tian3, Xingyu Zeng3, Rui Zhao3, Jifeng Dai4,5, Hongsheng Li1 ✉️, Xihui Liu2 ✉️

1CUHK MMLab, 2HKU MMLab, 3SenseTime, 4Shanghai AI Laboratory, 5Tsinghua University

*Equal contribution, ✉️Corresponding authors

🔥 We will release the code and models soon!

📖 Table of Contents

🆕Update

  • 2025.01.16: PUMA releases checkpoint and inference code for decoders 🔥.
  • 2024.10.18: PUMA preprint is released on ArXiv 🔥.
  • 2024.10.17: PUMA homepage is now available 🔥.

⌛ TODO

  • Update links to project page 🔗
  • Release visual encoder and decoders checkpoints 💻
  • Release MLLM backbone checkpoint 💾

Environment Setup

conda create -n puma python==3.8
conda activate puma
pip install -r requirements.txt

Checkpoint Download

# You should first replace the <token> with your huggingface token
python download_ckpt.py

For manual downloads, please download checkpoints from here and put the checkpoints under ./ckpts.

Multi-granular Visual Decoding

python infer_detokenizer.py --num_tokens <chosen number from [1, 4, 16, 64, 256]>

Abstract

PUMA introduces a unified multimodal large language model framework designed to integrate multi-granular visual generation and understanding. Our model excels in a variety of visual tasks, including diverse text-to-image generation, precise image editing, conditional image generation, and visual understanding. It strikes a balance between generation diversity and controllability, making it a versatile tool for visual tasks.

Read the full paper here.

Framework

  • PUMA leverages multi-granular visual representations as unified inputs and outputs for MLLM, allowing it to handle a variety of visual tasks, including text-to-image generation, image editing, inpainting, colorization, conditional generation, and image understanding.

Multi-granular Semantic Visual Decoding

  • PUMA's visual decoding process spans five granular image representations (f0 to f4) and corresponding decoders (D0 to D4), which are trained using SDXL. This allows PUMA to achieve precise image reconstruction and semantic-guided generation, supporting both control and diversity in image generation tasks.

Diverse Text-to-image Generation

Image Editing

Image Conditional Generation

Citation

If you find PUMA useful in your research, please consider citing us:

@article{fang2024puma,
  title     ={PUMA: Empowering Unified MLLM with Multi-Granular Visual Generation},
  author    ={Rongyao Fang, Chengqi Duan, Kun Wang, Hao Li, Hao Tian, Xingyu Zeng, Rui Zhao, Jifeng Dai, Hongsheng Li, Xihui Liu},
  journal   ={arxiv},
  year      ={2024}
}

License

This project is released under the Apache 2.0 license.

Contact

If you have any questions, please feel free to contact rongyaofang@gmail.com.
Rongyao Fang anticipates graduating in 2025 and is open to both academic and industrial research positions. If you are interested, please feel free to contact.

About

Empowering Unified MLLM with Multi-granular Visual Generation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages