Skip to content

mhh0318/UniD3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

UniD3: Unified Discrete Diffusion for Simultaneous Vision-Language Generation

unid3

Abstract

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

Setup

Installation Requirmenets

The code is compatible with python 3.8 and pytorch 1.9.

You can create an anaconda environment called UniD3 with the required dependencies by running:

git clone https://github.com/mhh0318/UniD3.git
cd UniD3
conda create -n unid3 python=3.8
pip install -r requirements.txt

Download Pretrained Weights

Download the pretrained models from here, and save them to pretrained_models/.

Download the released VQ-GAN model GumbelVQGAN on OpenImages and put them under ./misc/taming_dvae/

Quick Inference

For the simultaneous vision-language generation, please ru:

python ./UniDiff/dist_eval_sample.py --model CKPT_PATH  --condition unconditional --log pair_samples

If the environment is setup correctly, this command should function properly and generate some results in the folder /pair_samples.

Comments

BibTeX

@article{hu2022unified,
  title = {Unified Discrete Diffusion for Simultaneous Vision-Language Generation},
  author = {Hu, Minghui and Zheng, Chuanxia and Zheng, Heliang and Cham, Tat-Jen and Wang, Chaoyue and Yang, Zuopeng and Tao, Dacheng and Suganthan, Ponnuthurai N},
  journal = {arXiv},
  year = {2022},
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages