- Release ScenePair benchmark dataset and code of model;
- Release checkpoints and inference code;
- Release tranining pipeline;
- Provide demo link;
# Clone the repo
$ git clone https://github.com/weichaozeng/TextCtrl.git
$ cd TextCtrl/
# Install required packages
$ conda create --name textctrl python=3.8
$ conda activate textctrl
$ pip install torch==1.13.0+cu116 torchvision==0.14.0+cu116 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/cu116
$ pip install -r requirement.txt
Download the checkpoints from Link_1 and Link_2.The file structure should be set as follows:
TextCtrl/
├── weights/
│ ├── model.pth # weight of style encoder and unet
│ ├── text_encoder.pth # weight of pretrained glyph encoder
│ ├── style_encoder.pth # weight of pretrained style encoder
│ ├── vision_model.pth # monitor weight
│ ├── ocr_model.pth # ocr weight
│ ├── vgg19.pth # vgg weight
│ ├── vitstr_base_patch16_224.pth # vitstr weight
│ └── sd/ # pretrained weight of stable-diffusion-v1-5
│ ├── vae/
│ ├── unet/
│ └── scheduler/
├── README.md
├── ...
The file structure of inference data should be set as the example/:
TextCtrl/
├── example/
│ ├── i_s/ # source cropped text images
│ ├── i_s.txt # filename and text label of source images in i_s/
│ └── i_t.txt # filename and text label of target images
Edit the arguments in inference.py, especially:
parser.add_argument("--ckpt_path", type=str, default="weights/model.pth")
parser.add_argument("--dataset_dir", type=str, default="example/")
parser.add_argument("--output_dir", type=str, default="example_result/")
The inference result could be found in example_result/ after:
$ PYTHONPATH=.../TextCtrl/ python inference.py
Source Images | Target Text | Infer Results | Reference GT |
---|---|---|---|
"Private" | |||
"First" | |||
"RECORDS" | |||
"Sunset" | |||
"Network" |
The training relies on synthetic data generated by SRNet-Datagen with some modification for required elements. The file structure should be set as follows:
Syn_data/
├── fonts/
│ ├── arial.ttf/
│ └── .../
├── train/
│ ├── train-50k-1/
│ ├── train-50k-2/
│ ├── train-50k-3/
│ └── train-50k-4/
│ ├── i_s/
│ ├── mask_s/
│ ├── i_s.txt
│ ├── t_f/
│ ├── mask_t/
│ ├── i_t.txt
│ ├── t_t/
│ ├── t_b/
│ └── font.txt/
└── eval/
└── eval-1k/
$ cd prestyle/
# Modify the path of dir in the config file
$ cd configs/
$ vi StyleTrain.yaml
# Start pretraining
$ cd ..
$ python train.py
$ cd preglyph/
# Modify the path of dir in the config file
$ cd configs/
$ vi GlyphTrain.yaml
# Start pretraining
$ cd ..
$ python pretrain.py
$ cd TextCtrl/
# Modify the path of dir in the config file
$ cd configs/
$ vi train.yaml
# Start pretraining
$ cd ..
$ python train.py
Download the ScenePair dataset from Link and unzip the files. The structure of each folder is as follows:
├── ScenePair/
│ ├── i_s/ # source cropped text images
│ ├── t_f/ # target cropped text images
│ ├── i_full/ # full-size images
│ ├── i_s.txt # filename and text label of images in i_s/
│ ├── i_t.txt # filename and text label of images in t_f/
│ ├── i_s_full.txt # filename, text label, corresponding full-size image name and location information of images in i_s/
│ └── i_t_full.txt # filename, text label, corresponding full-size image name and location information of images in t_f/
Before evaluation, corresponding edited images should be generated for a certain method based on the ScenePair dataset and should be saved in a '.../result_folder/' with the same filename. Result of some methods on ScenePair dataset are provided here.
SSIM, PSNR, MSE and FID are uesd to evaluate the style fidelity of edited result, with reference to qqqyd/MOSTEL.
$ cd evaluation/
$ python evaluation.py --target_path .../result_folder/ --gt_path .../ScenePair/t_f/
ACC and NED are used to evaluate the text accuracy of edited result, with the offical code and checkpoint in clovaai/deep-text-recognition-benchmark.
Many thanks to these great projects lksshw/SRNet , youdao-ai/SRNet-Datagen , qqqyd/MOSTEL , UCSB-NLP-Chang/DiffSTE , ZYM-PKU/UDiffText , TencentARC/MasaCtrl , unilm/textdiffuser , tyxsspa/AnyText.
@article{zeng2024textctrl,
title={TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control},
author={Zeng, Weichao and Shu, Yan and Li, Zhenhang and Yang, Dongbao and Zhou, Yu},
journal={arXiv preprint arXiv:2410.10133},
year={2024}
}