The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
The code is compatible with python 3.8 and pytorch 1.9.
You can create an anaconda environment called UniD3
with the required dependencies by running:
git clone https://github.com/mhh0318/UniD3.git
cd UniD3
conda create -n unid3 python=3.8
pip install -r requirements.txt
Download the pretrained models from here, and save them to pretrained_models/
.
Download the released VQ-GAN model GumbelVQGAN on OpenImages and put them under ./misc/taming_dvae/
For the simultaneous vision-language generation, please ru:
python ./UniDiff/dist_eval_sample.py --model CKPT_PATH --condition unconditional --log pair_samples
If the environment is setup correctly, this command should function properly and generate some results in the folder /pair_samples
.
-
Our codebase for the diffusion models builds heavily on https://github.com/lucidrains/denoising-diffusion-pytorch, VQ-Diffusion and Multi-nomial Diffusion Thanks for open-sourcing!
-
The implementation of the transformer encoder is from x-transformers by lucidrains.
@article{hu2022unified,
title = {Unified Discrete Diffusion for Simultaneous Vision-Language Generation},
author = {Hu, Minghui and Zheng, Chuanxia and Zheng, Heliang and Cham, Tat-Jen and Wang, Chaoyue and Yang, Zuopeng and Tao, Dacheng and Suganthan, Ponnuthurai N},
journal = {arXiv},
year = {2022},
}