InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap Peng Tan, Weipeng Hu

Existing methods lack ability to control the interactions between objects in the generated content.
We propose a pluggable interaction control model, called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions.

News

[2024.3.13] Diffusers code is available at here.
[2024.3.8] Demo is available at Huggingface Spaces.
[2024.3.6] Code is released.
[2024.2.27] InteractionDiffusion paper is accepted at CVPR 2024.
[2023.12.12] InteractionDiffusion paper is released. WebUI of InteractDiffusion is available as alpha version.

Results

Model	Interaction Controllability		FID	KID
Model	Tiny	Large	FID	KID
v1.0	29.53	31.56	18.69	0.00676
v1.1	30.20	31.96	17.90	0.00635
v1.2	30.73	33.10	17.32	0.00585

Interaction Controllability is measured using FGAHOI detection score. In this table, we measure the Full subset in Default setting on Swin-Tiny and Swin-Large backbone. More details on the protocol is in the paper.

Download InteractDiffusion models

We provide three checkpoints with different training strategies.

Version	Dataset	SD	Download
v1.0	HICO-DET	v1.4	HF Hub
v1.1	HICO-DET	v1.5	HF Hub
v1.2	HICO-DET + VisualGenome	v1.5	HF Hub

Note that the experimental results in our paper is referring to v1.0.

v1.0 is based on Stable Diffusion v1.4 and GLIGEN. We train at batch size of 16 for 250k steps on HICO-DET. Our paper is based on this.
v1.1 is based on Stable Diffusion v1.5 and GLIGEN. We train at batch size of 32 for 250k steps on HICO-DET.
v1.1 is based on InteractDiffusion v1.1. We train further at batch size of 32 for 172.5k steps on HICO-DET and VisualGenome.

Extension for AutomaticA111's Stable Diffusion WebUI

We develop an AutomaticA111's Stable Diffuion WebUI extension to allow the use of InteractDiffusion over existing SD models. Check out the plugin at sd-webui-interactdiffusion. Note that it is still on alpha version.

Gallery

Some examples generated with InteractDiffusion, together with other DreamBooth and LoRA models.

Diffusers

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "interactdiffusion/diffusers-v1-2",
    trust_remote_code=True,
    variant="fp16", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")

images = pipeline(
    prompt="a person is feeding a cat",
    interactdiffusion_subject_phrases=["person"],
    interactdiffusion_object_phrases=["cat"],
    interactdiffusion_action_phrases=["feeding"],
    interactdiffusion_subject_boxes=[[0.0332, 0.1660, 0.3359, 0.7305]],
    interactdiffusion_object_boxes=[[0.2891, 0.4766, 0.6680, 0.7930]],
    interactdiffusion_scheduled_sampling_beta=1,
    output_type="pil",
    num_inference_steps=50,
    ).images

images[0].save('out.jpg')

Reproduce & Evaluate

Change ckpt.pth in interence_batch.py to selected checkpoint.

Made inference on InteractDiffusion to synthesis the test set of HICO-DET based on the ground truth.

python inference_batch.py --batch_size 1 --folder generated_output --seed 489 --scheduled-sampling 1.0 --half

python inference.py

Setup FGAHOI at ../FGAHOI. See FGAHOI repo on how to setup FGAHOI and also HICO-DET dataset in data/hico_20160224_det.
Prepare for evaluate on FGAHOI. See id_prepare_inference.ipynb

Evaluate on FGAHOI.

python main.py --backbone swin_tiny --dataset_file hico --resume weights/FGAHOI_Tiny.pth --num_verb_classes 117 --num_obj_classes 80 --output_dir logs  --merge --hierarchical_merge --task_merge --eval --hoi_path data/id_generated_output --pretrain_model_path "" --output_dir logs/id-generated-output-t

Evaluate for FID and KID. We recommend to resize hico_det dataset to 512x512 before perform image quality evaluation, for a fair comparison. We use torch-fidelity.
```
fidelity --gpu 0 --fid --isc --kid --input2 ~/data/hico_det_test_resize  --input1 ~/FGAHOI/data/data/id_generated_output/images/test2015
```
This should provide a brief overview of how the evaluation process works.

Training

Prepare the necessary dataset and pretrained models, see DATA

Run the following command:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt <existing_gligen_checkpoint> --name test --batch_size=4 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name <existing SD v1.4/v1.5 checkpoint>

=============================Me setting================================

  ```bash
  CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt ./interact-diffusion-v1.pth --name test --batch_size=4 --gradient_accumulation_step 1 --total_iters 500000 --amp true --disable_inference_in_training true
  
  UDA_VISIBLE_DEVICES=1 torchrun --nproc_per_node=1 main.py --yaml_file configs/hoi_hico_text.yaml --name test --batch_size=4 --gradient_accumulation_step 1 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt
  
  UDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt ./interact-diffusion-v1.pth --name test --batch_size=4 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true 
  ```

=============# train script (latest version):

  ```bash
  CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node=2 --master_port=1122 main.py --yaml_file configs/E2VG_stable_diffusion_config.yaml --name test_stablediffu_baseline_2gpus5batchsize --batch_size=5 --gradient_accumulation_step 1 --total_iters 1000000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt

  CUDA_VISIBLE_DEVICES=1 torchrun --nproc_per_node=1 --master_port=8199 main.py --yaml_file configs/E2VG_stable_diffusion_config.yaml --name test_stablediffu_baseline_1gpus5batchsize_2gradientaccum --batch_size=5 --gradient_accumulation_step 2 --total_iters 2000000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt

  ###debug mode shows distribute error info
  TORCH_DISTRIBUTED_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=12345 main.py --yaml_file configs/hoi_hico_text.yaml --name test_stablediffu_20BS --batch_size=20 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt
  ```
  
  ```bash
  =========================================== E2VG TRAIN =========================================
  conda activate FGT_ENV310_Diffusion
  
  CUDA_VISIBLE_DEVICES=1,2 torchrun --nproc_per_node=2 --master_port=1111 main.py --yaml_file configs/E2VG_stable_diffusion_config.yaml --name test_stablediffu_baseline_2gpus5batchsize --batch_size=5 --gradient_accumulation_step 1 --total_iters 1000000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt
  
  =========================================== E2VG TEST ==========================================
  python inference_BE2VG_MultipleGPUs_MultipleWorker.py

  =========================================== E2VG inference good seeds ==========================================
  seed: 13798, 555
  
  ================================= training log with tensorboard ================================
  （1）BE2VG_stable_diffusion_baseline代码查看loss
       pip install tensorboard --upgrade  
       tensorboard --logdir ./ --bind_all     #可能报错
       tensorboard --logdir ./ --bind_all --load_fast=false
  （2）windows端查看，打开命令行输入：
       vscode==>ports==>add ports ==> 6006 ==> 右键 open in browser
       ssh  jianhuili@c6000.dynip.ntu.edu.sg -L 6006:localhost:6006
  （3）windows端浏览器查看loss
       http://localhost:6006/

  ```

TODO

Code Release
HuggingFace demo
WebUI extension
Diffuser

Citation

@InProceedings{Hoe_2024_CVPR,
    author    = {Hoe, Jiun Tian and Jiang, Xudong and Chan, Chee Seng and Tan, Yap-Peng and Hu, Weipeng},
    title     = {InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2024},
    pages     = {6180-6189}
}

Acknowledgement

This work is developed based on the codebase of GLIGEN and LDM.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
DATA		DATA
configs		configs
dataset		dataset
docs		docs
grounding_input		grounding_input
ldm		ldm
.gitignore		.gitignore
README.md		README.md
SD_input_conv_weight_bias.pth		SD_input_conv_weight_bias.pth
convert_ckpt.py		convert_ckpt.py
distributed.py		distributed.py
environment.yml		environment.yml
id_prepare_inference.ipynb		id_prepare_inference.ipynb
inference.py		inference.py
inference_batch.py		inference_batch.py
main.py		main.py
projection_matrix		projection_matrix
trainer.py		trainer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

News

Results

Download InteractDiffusion models

Extension for AutomaticA111's Stable Diffusion WebUI

Gallery

Diffusers

Reproduce & Evaluate

Training

TODO

Citation

Acknowledgement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model

News

Results

Download InteractDiffusion models

Extension for AutomaticA111's Stable Diffusion WebUI

Gallery

Diffusers

Reproduce & Evaluate

Training

TODO

Citation

Acknowledgement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages