TL;DR: We introduce a new framework, StyleT2I, to achieve compositional and high-fidelity text-to-image synthesis results.
Figure 1. When the text input contains underrepresented compositions of attributes, e.g., (he, wearing lipstick), in the dataset, previous methods [1-3] incorrectly generate the attributes with poor image quality. In contrast, StyleT2I achieves better compositionality and high-fidelity text-to-image synthesis results.
StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis
Zhiheng Li, Martin Renqiang Min, Kai Li, Chenliang Xu
NEC Laboratories America, University of Rochester
Contact: Zhiheng Li (email: zhiheng.li@rochester.edu, homepage: https://zhiheng.li)
pytorch torchvision torchtext pandas ninja
Put each dataset in a folder under the data
directory as follows:
data
├── celebahq
├── cub
├── ffhq
└── nabirds
CelebA-HQ
download CelebAMask-HQ from here and unzip it to data/celebahq/CelebAMask-HQ
CUB
download CUB from here and unzip it to data/cub/CUB_200_2011
NABirds
download and unzip NABirds dataset from here to data/nabirds
Download the pretrained StyleGAN2 models to exp/pretrained_stylegan2
from here.
The following commands are the the bash scripts of training on CelebA-HQ dataset. For other datasets, simply replace the folder /celebahq/
with other datasets, e.g., /cub/
, /ffhq/
, and /nabirds/
.
Our StyleGAN2 code is based on https://github.com/rosinality/stylegan2-pytorch's implementation.
If you prefer pretraining StyleGAN2 by youself, you can use the following command. Otherwise, use the pretrained model provided above.
bash scripts/celebahq/pretrain_stylegan2.sh
bash scripts/celebahq/ft_clip_text.sh
Note that finetuning CLIP is only available on CelebA-HQ and CUB datasets and not available on FFHQ and NABirds datasets because FFHQ and NABirds datasets do not have text annotations. However, StyleT2I can perform cross-dataset generation, i.e., StyleT2I-XD. More details are in the paper.
bash scripts/celebahq/train.sh
bash scripts/celebahq/synthesize.sh
[1] B. Li, X. Qi, T. Lukasiewicz, and P. Torr, “Controllable Text-to-Image Generation,” in NeurIPS, 2019.
[2] S. Ruan et al., “DAE-GAN: Dynamic Aspect-aware GAN for Text-to-Image Synthesis,” in ICCV, 2021.
[3] W. Xia, Y. Yang, J.-H. Xue, and B. Wu, “TediGAN: Text-Guided Diverse Face Image Generation and Manipulation,” in CVPR, 2021.
@InProceedings{Li_2022_CVPR,
author = {Li, Zhiheng and Min, Martin Renqiang and Li, Kai and Xu, Chenliang},
title = {StyleT2I: Toward Compositional and High-Fidelity Text-to-Image Synthesis},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year = {2022}
}