Skip to content

Latest commit

 

History

History

textdiffuser

TextDiffuser: Diffusion Models as Text Painters (NeurIPS 2023)

Hugging Face Spaces

TextDiffuser generates images with visually appealing text that is coherent with backgrounds. It is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text.

🌟 Highlights

  • We propose TextDiffuser, which is a two-stage diffusion-based framework for text rendering. It generates accurate and coherent text images from text prompts or additionally with template images, as well as conducting text inpainting to reconstruct incomplete images.

  • We release MARIO-10M, containing large-scale image-text pairs with OCR annotations, including text recognition, detection, and character-level segmentation masks.

  • We construct MARIO-Eval, a comprehensive text rendering benchmark containing 10k prompts at link.

  • We release the demo at link. Welcome to use and provide feedbacks 🤗.

⏱️ News

  • [2023.09.22]: 🎉 TextDiffuser is accepted to NeurIPS 2023.
  • [2023.06.22]: Evaluation script is released.
  • [2023.06.15]: 🙌 🙌 🙌 The Demo of TextDiffuser pre-trained with SD v2.1 is released in this link. Meanwhile, GoogleColab is available in this link.
  • [2023.06.08]: Training script is released.
  • [2023.06.07]: MARIO-LAION is released.
  • [2023.06.02]: 🙌 🙌 🙌 Demo is available in this link.
  • [2023.05.26]: Upload the inference code and checkpoint.
  • [2023.05.19]: The paper is available at link.

🛠️ Installation

Clone this repo:

git clone github_path_to/TextDiffuser
cd TextDiffuser

Build up a new environment and install packages as follows:

conda create -n textdiffuser python=3.8
conda activate textdiffuser
pip install -r requirements.txt

Meanwhile, please install torch and torchvision that matches the version of system and cuda (refer to this link).

Install Hugging Face Diffuser and replace some files:

git clone https://github.com/JingyeChen/diffusers
cp ./assets/files/scheduling_ddpm.py ./diffusers/src/diffusers/schedulers/scheduling_ddpm.py
cp ./assets/files/unet_2d_condition.py ./diffusers/src/diffusers/models/unet_2d_condition.py
cp ./assets/files/modeling_utils.py ./diffusers/src/diffusers/models/modeling_utils.py
cd diffusers && pip install -e .

Besides, a font file is needed for layout generation. Please put your font in assets/font/. We recommend to use Arial.ttf.

💾 Checkpoint

The checkpoints are in HFLink (3.2GB). Please download it and unzip it. The file structures should be as follows:

textdiffuser
├── textdiffuser-ckpt
│   ├── diffusion_backbone/             # for diffusion backbone
│   ├── character_aware_loss_unet.pth   # for character-aware loss
│   ├── layout_transformer.pth          # for layout transformer
│   └── text_segmenter.pth              # for character-level segmenter
├── README.md

📚 Dataset

MARIO-LAION's meta information is at googledrive (40GB), containing 9,194,613 samples. Please download it and unzip it by running python data/maion-laion-unzip.py. The file structures of each folder should be as follows and data/maion-laion-example is provided for reference. We also provide data/visualize_charseg.ipynb to visualize the character-level segmentation mask.

├── 28330/
│   ├── 283305839/            
│   │   ├── caption.txt       # caption of the image
│   │   ├── charseg.npy       # character-level segmentation mask
│   │   ├── info.json         # more meta information given by laion, such as original height and width
├── ├── └── ocr.txt           # ocr detection and recognition results

The urls of each image is at googledrive (794.6MB). The file structure is as follows:

├── maion_laion_image_url/
│   ├── mario-laion-url.txt         # urls for downloading by img2dataset
│   ├── mario-laion-index-url.txt   # urls and indices for each image
│   └── mario-laion-test-index.txt  # all indices for test dataset

Please download img2dataset wiht pip install img2dataset, and download the images using the following command:

img2dataset --url_list=url.txt --output_folder=laion_ocr --thread_count=64  --resize_mode=no

After downloading, you need to resize each image to 512x512. Please follow mario-laion-index-url.txt to move each image to the corresponding folders. Images with indices in mario-laion-test-index.txt are used for testing. Please note that some links may be invalid since the owners remove the images from their website.

🚂 Train

Please use accelerate config to configure your acceleration policy at first, then modify output_dir, dataset_path, and train_dataset_index_file in train.sh. The train_dataset_index_file should be a .txt file, and each line should indicate an index of a training sample.

06269_062690093
27197_271975251
27197_271978467
...

Then you can use the following to run TextDiffuser:

accelerate launch train.py \
    --train_batch_size=24 \
    --gradient_accumulation_steps=4 \
    --gradient_checkpointing \
    --mixed_precision="fp16" \
    --num_train_epochs=2 \
    --learning_rate=1e-5 \
    --max_grad_norm=1 \
    --lr_scheduler="constant" \
    --lr_warmup_steps=0 \
    --output_dir="experiment_name" \
    --enable_xformers_memory_efficient_attention \
    --dataloader_num_workers=4 \
    --character_aware_loss_lambda=0.01 \
    --resume_from_checkpoint="latest" \
    --drop_caption \
    --mask_all_ratio=0.5 \
    --segmentation_mask_aug \
    --dataset_path=/home/path/to/laion-ocr-unzip \
    --train_dataset_index_file=/path/to/index_file.txt \
    --vis_num=8

If you encounter an "out-of-memory" error, please consider reducing the batch size appropriately.

🧨 Inference

TextDiffuser can be applied on: text-to-image, text-to-image-with-template, and text-inpainting.

Text-to-Image

This task is designed to generate images based on given prompts. Users are required to enclose the keywords to be drawn with single quotation marks.

CUDA_VISIBLE_DEVICES=0 python inference.py \
  --mode="text-to-image" \
  --resume_from_checkpoint="textdiffuser-ckpt/diffusion_backbone" \
  --prompt="A sign that says 'Hello'" \
  --output_dir="./output" \
  --vis_num=4

Text-to-Image-with-Template

This task aims to generate images based on given prompts and template images (can be printed, handwritten, or scene text images). A pre-trained character-level segmentation model is used to extract layout information from the template image.

CUDA_VISIBLE_DEVICES=0 python inference.py \
  --mode="text-to-image-with-template" \
  --resume_from_checkpoint="textdiffuser-ckpt/diffusion_backbone" \
  --prompt="a poster of monkey music festival" \
  --template_image="assets/examples/text-to-image-with-template/case2.jpg" \
  --output_dir="./output" \
  --vis_num=4

Text-Inpainting

This task aims to modify a given image in an inpainting manner. The provided text mask image should contain the inpainting region and the text to be drawn within the region.

CUDA_VISIBLE_DEVICES=0 python inference.py \
  --mode="text-inpainting" \
  --resume_from_checkpoint="textdiffuser-ckpt/diffusion_backbone" \
  --prompt="a boy draws good morning on a board" \
  --original_image="assets/examples/text-inpainting/case2.jpg" \
  --text_mask="assets/examples/text-inpainting/case2_mask.jpg" \
  --output_dir="./output" \
  --vis_num=4

📈 Evaluation

For evaluation, please download MARIOEval and the generation results of each methods are at link for reference. . MARIOEval contains 5,414 prompts for evaluation, including the following subsets:

Subset #Sample Subset #Sample
LAIONEval4000 4,000 ChineseDrawText 175
TMDBEval500 500 DrawBenchText 21
OpenLibrary500 500 DrawTextCreative 218

The structure of each folder is as follows:

├── LAIONEval4000/
│   ├── images/                       # ground truth images
│   ├── render/                       # layouts of keywords generated by Layout Transformer
│   ├── LAIONEval4000.txt             # prompts with keywords enclosed with quotes
│   └── LAIONEval4000_wo_quote.txt    # prompts without quotes

Please note that the ground truth images are only available for the LAIONEval4000, TMDBEval500, and OpenLibrary500 subsets. The render images are used for evaluating ControlNet. We manually enclose keywords with quotes according to the ocr results. Please refer to the _wo_quote.txt version for original prompts.

To evaluate TextDiffuser, please use the following command for sampling:

CUDA_VISIBLE_DEVICES=0 python evaluate.py \
  --mode="text-to-image" \
  --resume_from_checkpoint="textdiffuser-ckpt/diffusion_backbone" \
  --prompt_list="/path/to/MARIOEval/TMDBEval500/TMDBEval500.txt" \
  --output_dir="/path/to/output_dir" \
  --vis_num=4

To sample from other baseline methods (e.g, Stable Diffusion, ControlNet, and DeepFloyd), the scripts are provided in the ./eval folder. We also provided the scripts for calculating FID, Clip Score, as well as the OCR metrics.

Metrics Stable Diffusion ContolNet DeepFloyd TextDiffuser (Ours)
FID↓ 51.295 51.485 34.902 38.758
CLIPScore↑ 0.3015 0.3424 0.3267 0.3436
OCR-Accuracy↑ 0.0003 0.2390 0.0262 0.5609
OCR-Precision↑ 0.0173 0.5211 0.1450 0.7846
OCR-Recall↑ 0.0280 0.6707 0.2245 0.7802
OCR-Fmeasure↑ 0.0214 0.5865 0.1762 0.7824
*OCR-Accuracy↑ 0.0178 0.2705 0.0457 0.5712
*OCR-Precision↑ 0.0192 0.5391 0.1738 0.7795
*OCR-Recall↑ 0.0260 0.6438 0.2235 0.7498
*OCR-Fmeasure↑ 0.0221 0.5868 0.1955 0.7643

Please note that OCR metrics begin with "*" mean we use open-source MaskTextSpotterV3 for evaluation, and without "*" denote we use MicroSoft OCR API for evaluation. The performance of text-to-image on MARIO-Eval compared with existing methods. TextDiffuser performs the best regarding CLIPScore and OCR evaluation while achieving comparable performance on FID.

User studies for whole-image generation and part-image generation tasks. (a) For whole-image generation, our method clearly outperforms others in both aspects of text rendering quality and image-text matching. (b) For part-image generation, our method receives high scores from human evaluators in these two aspects.

🕹️ Demo

TextDiffuser has been deployed on Hugging Face. If you have advanced GPUs, you may deploy the demo locally as follows:

CUDA_VISIBLE_DEVICES=0 python gradio_app.py

Then you can enjoy the demo with local browser:

🖼️ Gallery

Text-to-Image

Text-to-Image-with-Template

Text-Inpainting

💌 Acknowledgement

We sincerely thank the following projects: Hugging Face Diffuser, LAION, DB, PARSeq, img2dataset.

Also, special thanks to the open-source diffusion project or available demo: DALLE, Stable Diffusion, Stable Diffusion XL, Midjourney, ControlNet, DeepFloyd.

❗ Disclaimer

Please note that the code is intended for academic and research purposes ONLY. Any use of the code for generating inappropriate content is strictly prohibited. The responsibility for any misuse or inappropriate use of the code lies solely with the users who generated such content, and this code shall not be held liable for any such use.

✉️ Contact

For help or issues using TextDiffuser, please email Jingye Chen (qwerty.chen@connect.ust.hk), Yupan Huang (huangyp28@mail2.sysu.edu.cn) or submit a GitHub issue.

For other communications related to TextDiffuser, please contact Lei Cui (lecu@microsoft.com) or Furu Wei (fuwei@microsoft.com).

🌿 Citation

If you find this code useful in your research, please consider citing:

@article{chen2023textdiffuser,
  title={TextDiffuser: Diffusion Models as Text Painters},
  author={Chen, Jingye and Huang, Yupan and Lv, Tengchao and Cui, Lei and Chen, Qifeng and Wei, Furu},
  journal={arXiv preprint arXiv:2305.10855},
  year={2023}
}