Jiun Tian Hoe, Xudong Jiang, Chee Seng Chan, Yap Peng Tan, Weipeng Hu
Project Page | paper | arXiv | WebUI | Demo | Video | Diffuser | Colab
- Existing methods lack ability to control the interactions between objects in the generated content.
- We propose a pluggable interaction control model, called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions.
- [2024.3.13] Diffusers code is available at here.
- [2024.3.8] Demo is available at Huggingface Spaces.
- [2024.3.6] Code is released.
- [2024.2.27] InteractionDiffusion paper is accepted at CVPR 2024.
- [2023.12.12] InteractionDiffusion paper is released. WebUI of InteractDiffusion is available as alpha version.
| Model | Interaction Controllability | FID | KID | |
|---|---|---|---|---|
| Tiny | Large | |||
| v1.0 | 29.53 | 31.56 | 18.69 | 0.00676 |
| v1.1 | 30.20 | 31.96 | 17.90 | 0.00635 |
| v1.2 | 30.73 | 33.10 | 17.32 | 0.00585 |
Interaction Controllability is measured using FGAHOI detection score. In this table, we measure the Full subset in Default setting on Swin-Tiny and Swin-Large backbone. More details on the protocol is in the paper.
We provide three checkpoints with different training strategies.
| Version | Dataset | SD | Download |
|---|---|---|---|
| v1.0 | HICO-DET | v1.4 | HF Hub |
| v1.1 | HICO-DET | v1.5 | HF Hub |
| v1.2 | HICO-DET + VisualGenome | v1.5 | HF Hub |
Note that the experimental results in our paper is referring to v1.0.
- v1.0 is based on Stable Diffusion v1.4 and GLIGEN. We train at batch size of 16 for 250k steps on HICO-DET. Our paper is based on this.
- v1.1 is based on Stable Diffusion v1.5 and GLIGEN. We train at batch size of 32 for 250k steps on HICO-DET.
- v1.1 is based on InteractDiffusion v1.1. We train further at batch size of 32 for 172.5k steps on HICO-DET and VisualGenome.
We develop an AutomaticA111's Stable Diffuion WebUI extension to allow the use of InteractDiffusion over existing SD models. Check out the plugin at sd-webui-interactdiffusion. Note that it is still on alpha version.
Some examples generated with InteractDiffusion, together with other DreamBooth and LoRA models.
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
from diffusers import DiffusionPipeline
import torch
pipeline = DiffusionPipeline.from_pretrained(
"interactdiffusion/diffusers-v1-2",
trust_remote_code=True,
variant="fp16", torch_dtype=torch.float16
)
pipeline = pipeline.to("cuda")
images = pipeline(
prompt="a person is feeding a cat",
interactdiffusion_subject_phrases=["person"],
interactdiffusion_object_phrases=["cat"],
interactdiffusion_action_phrases=["feeding"],
interactdiffusion_subject_boxes=[[0.0332, 0.1660, 0.3359, 0.7305]],
interactdiffusion_object_boxes=[[0.2891, 0.4766, 0.6680, 0.7930]],
interactdiffusion_scheduled_sampling_beta=1,
output_type="pil",
num_inference_steps=50,
).images
images[0].save('out.jpg')-
Change
ckpt.pthin interence_batch.py to selected checkpoint. -
Made inference on InteractDiffusion to synthesis the test set of HICO-DET based on the ground truth.
python inference_batch.py --batch_size 1 --folder generated_output --seed 489 --scheduled-sampling 1.0 --half python inference.py
-
Setup FGAHOI at
../FGAHOI. See FGAHOI repo on how to setup FGAHOI and also HICO-DET dataset indata/hico_20160224_det. -
Prepare for evaluate on FGAHOI. See
id_prepare_inference.ipynb -
Evaluate on FGAHOI.
python main.py --backbone swin_tiny --dataset_file hico --resume weights/FGAHOI_Tiny.pth --num_verb_classes 117 --num_obj_classes 80 --output_dir logs --merge --hierarchical_merge --task_merge --eval --hoi_path data/id_generated_output --pretrain_model_path "" --output_dir logs/id-generated-output-t -
Evaluate for FID and KID. We recommend to resize hico_det dataset to 512x512 before perform image quality evaluation, for a fair comparison. We use torch-fidelity.
fidelity --gpu 0 --fid --isc --kid --input2 ~/data/hico_det_test_resize --input1 ~/FGAHOI/data/data/id_generated_output/images/test2015
-
This should provide a brief overview of how the evaluation process works.
-
Prepare the necessary dataset and pretrained models, see DATA
-
Run the following command:
CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt <existing_gligen_checkpoint> --name test --batch_size=4 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name <existing SD v1.4/v1.5 checkpoint>
=============================Me setting================================
```bash
CUDA_VISIBLE_DEVICES=2 torchrun --nproc_per_node=1 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt ./interact-diffusion-v1.pth --name test --batch_size=4 --gradient_accumulation_step 1 --total_iters 500000 --amp true --disable_inference_in_training true
UDA_VISIBLE_DEVICES=1 torchrun --nproc_per_node=1 main.py --yaml_file configs/hoi_hico_text.yaml --name test --batch_size=4 --gradient_accumulation_step 1 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt
UDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt ./interact-diffusion-v1.pth --name test --batch_size=4 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true
```
=============# train script (latest version):
```bash
CUDA_VISIBLE_DEVICES=2,3 torchrun --nproc_per_node=2 --master_port=1122 main.py --yaml_file configs/E2VG_stable_diffusion_config.yaml --name test_stablediffu_baseline_2gpus5batchsize --batch_size=5 --gradient_accumulation_step 1 --total_iters 1000000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt
CUDA_VISIBLE_DEVICES=1 torchrun --nproc_per_node=1 --master_port=8199 main.py --yaml_file configs/E2VG_stable_diffusion_config.yaml --name test_stablediffu_baseline_1gpus5batchsize_2gradientaccum --batch_size=5 --gradient_accumulation_step 2 --total_iters 2000000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt
###debug mode shows distribute error info
TORCH_DISTRIBUTED_DEBUG=INFO CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 --master_port=12345 main.py --yaml_file configs/hoi_hico_text.yaml --name test_stablediffu_20BS --batch_size=20 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt
```
```bash
=========================================== E2VG TRAIN =========================================
conda activate FGT_ENV310_Diffusion
CUDA_VISIBLE_DEVICES=1,2 torchrun --nproc_per_node=2 --master_port=1111 main.py --yaml_file configs/E2VG_stable_diffusion_config.yaml --name test_stablediffu_baseline_2gpus5batchsize --batch_size=5 --gradient_accumulation_step 1 --total_iters 1000000 --amp true --disable_inference_in_training true --official_ckpt_name ./v1-5-pruned-emaonly.ckpt
=========================================== E2VG TEST ==========================================
python inference_BE2VG_MultipleGPUs_MultipleWorker.py
=========================================== E2VG inference good seeds ==========================================
seed: 13798, 555
================================= training log with tensorboard ================================
(1)BE2VG_stable_diffusion_baseline代码查看loss
pip install tensorboard --upgrade
tensorboard --logdir ./ --bind_all #可能报错
tensorboard --logdir ./ --bind_all --load_fast=false
(2)windows端查看,打开命令行输入:
vscode==>ports==>add ports ==> 6006 ==> 右键 open in browser
ssh jianhuili@c6000.dynip.ntu.edu.sg -L 6006:localhost:6006
(3)windows端浏览器查看loss
http://localhost:6006/
```
- Code Release
- HuggingFace demo
- WebUI extension
- Diffuser
@InProceedings{Hoe_2024_CVPR,
author = {Hoe, Jiun Tian and Jiang, Xudong and Chan, Chee Seng and Tan, Yap-Peng and Hu, Weipeng},
title = {InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
pages = {6180-6189}
}This work is developed based on the codebase of GLIGEN and LDM.












