Skip to content

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

License

Notifications You must be signed in to change notification settings

TempleX98/EasyRef

Repository files navigation

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

1CUHK MMLab, 2SenseTime Research, 3Shanghai AI Laboratory

GitHub

EasyRef is capable of modeling the consistent visual elements of various group image references with a single generalist multimodal LLM in a zero-shot setting.

Release

Installation

# install requirements
pip install git+https://github.com/TempleX98/EasyRef.git
cd EasyRef
pip install -r requirements.txt

# download the models
git lfs install
git clone https://huggingface.co/zongzhuofan/EasyRef
mv EasyRef checkpoints

# download the basemodel and the multimodal llm
git clone https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
git clone https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct

# then you can use the notebook

Demos

More visualization examples are available in our project page.

Comparison with IP-Adapter

Compatibility with ControlNet

Training

EasyRef adopts a progressive training scheme: (1) alignment pretraining stage: facilitate the adaption of MLLM's visual signals to the diffusion model; (2) single-reference finetuning stage: enhance the MLLM's capacity for fine-grained visual perception and identity recognition; (3) multi-reference finetuning stage: enable the MLLM to accurately comprehend the common elements across multiple image references.

Data Format

You should first organize your training data according to the following format and make your own dataset into a json file. In this example, the image 1.jpg serves as the reconstruction target and reference image.

[{
    "image_file": [
        "1.jpg"
    ],
    "text": [
        "A fantasy character."
    ],
    "target": "1.jpg"
}]

In this example, we crop the face region of the reconstruction target 1.jpg using face detection result then use the cropped image as the reference image.

[{
    "face": [
        [[555.5458374023438, 71.08999633789062, 668.3660278320312, 233.39280700683594, 0.9999279975891113]]
    ],    
    "text": [
        "A fantasy character."
    ],
    "target": "1.jpg"   
}]

In this example, we select the image 1.jpg from the group as the reconstruction target and the remaining images are reference images.

[{
    "image_file": [
        "2.jpg",
        "3.jpg",
        "4.jpg",
        "5.jpg"
    ],
    "text": [
        "A fantasy character."
    ],
    "target": "1.jpg"
}]

Alignment Pretraining

We only train the final layer of MLLM, the projection layer, and cross-attention adapters during pretraining. We provide the training script here.

Single-reference Finetuning

During this stage, we jointly train the MLLM, its final layer, the projection layer, newly-added LoRA layers and cross-attention adapters. The model is initialized with the checkpoint trained by alignment pretraining. We provide the training script here.

Multi-reference Finetuning

We train the same components as the previous stage but use different training data and augmentations. EasyRef is trained on 32 A100 GPUs with 80GB memory using DeepSpeed ZeRO-2. To train on fewer GPUs, you can reduce the num_processes in the script and increase the gradient_accumulation_steps accordingly. We provide the training script with DeepSpeed here.

Inference

We provide the inference code of EasyRef with SDXL in easyref_demo.

Usage Tips

  • EasyRef performs best when provided with multiple reference images (more than 2).
  • To ensure better identity preservation, we strongly recommend that users upload multiple square face images, ensuring the face occupies the majority of each image.
  • Using multimodal prompts (both reference images and non-empty text prompt) can achieve better results.
  • We set scale=1.0 by default. Lowering the scale value leads to more diverse but less consistent generation results.

Acknowledgements

Disclaimer

This project is released under Apache License 2.0. We release our checkpoints for research purposes only. Users are granted the freedom to create images using this tool, but they are expected to comply with local laws and utilize it in a responsible manner. The developers do not assume any responsibility for potential misuse by users.

Cite

If you find EasyRef useful for your research and applications, please cite us using this BibTeX:

@article{easyref,
  title={EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM},
  author={Zong, Zhuofan and Jiang, Dongzhi and Ma, Bingqi and Song, Guanglu and Shao, Hao and Shen, Dazhong and Liu, Yu and Li, Hongsheng},
  journal={arXiv preprint arXiv:2412.09618},  
  year={2024}
}

About

EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published