UniD3: Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Abstract

The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

Setup

Installation Requirmenets

The code is compatible with python 3.8 and pytorch 1.9.

You can create an anaconda environment called UniD3 with the required dependencies by running:

git clone https://github.com/mhh0318/UniD3.git
cd UniD3
conda create -n unid3 python=3.8
pip install -r requirements.txt

Download Pretrained Weights

Download the pretrained models from here, and save them to pretrained_models/.

Download the released VQ-GAN model GumbelVQGAN on OpenImages and put them under ./misc/taming_dvae/

Quick Inference

For the simultaneous vision-language generation, please ru:

python ./UniDiff/dist_eval_sample.py --model CKPT_PATH  --condition unconditional --log pair_samples

If the environment is setup correctly, this command should function properly and generate some results in the folder /pair_samples.

Comments

Our codebase for the diffusion models builds heavily on https://github.com/lucidrains/denoising-diffusion-pytorch, VQ-Diffusion and Multi-nomial Diffusion Thanks for open-sourcing!
The implementation of the transformer encoder is from x-transformers by lucidrains.

BibTeX

@article{hu2022unified,
  title = {Unified Discrete Diffusion for Simultaneous Vision-Language Generation},
  author = {Hu, Minghui and Zheng, Chuanxia and Zheng, Heliang and Cham, Tat-Jen and Wang, Chaoyue and Yang, Zuopeng and Tao, Dacheng and Suganthan, Ponnuthurai N},
  journal = {arXiv},
  year = {2022},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
UniDiff		UniDiff
misc		misc
model		model
README.md		README.md
requirements.txt		requirements.txt
tasks.png		tasks.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UniD3: Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Abstract

Setup

Installation Requirmenets

Download Pretrained Weights

Quick Inference

Comments

BibTeX

About

Releases

Packages

Languages

mhh0318/UniD3

Folders and files

Latest commit

History

Repository files navigation

UniD3: Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Abstract

Setup

Installation Requirmenets

Download Pretrained Weights

Quick Inference

Comments

BibTeX

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages