- [2024/12/13] 🔥 We release the training code, inference code, and pre-trained model!
# install requirements
pip install git+https://github.com/TempleX98/EasyRef.git
cd EasyRef
pip install -r requirements.txt
# download the models
git lfs install
git clone https://huggingface.co/zongzhuofan/EasyRef
mv EasyRef checkpoints
# download the basemodel and the multimodal llm
git clone https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0
git clone https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct
# then you can use the notebook
More visualization examples are available in our project page.
EasyRef adopts a progressive training scheme: (1) alignment pretraining stage: facilitate the adaption of MLLM's visual signals to the diffusion model; (2) single-reference finetuning stage: enhance the MLLM's capacity for fine-grained visual perception and identity recognition; (3) multi-reference finetuning stage: enable the MLLM to accurately comprehend the common elements across multiple image references.
You should first organize your training data according to the following format and make your own dataset into a json file.
In this example, the image 1.jpg
serves as the reconstruction target and reference image.
[{
"image_file": [
"1.jpg"
],
"text": [
"A fantasy character."
],
"target": "1.jpg"
}]
In this example, we crop the face region of the reconstruction target 1.jpg
using face detection result then use the cropped image as the reference image.
[{
"face": [
[[555.5458374023438, 71.08999633789062, 668.3660278320312, 233.39280700683594, 0.9999279975891113]]
],
"text": [
"A fantasy character."
],
"target": "1.jpg"
}]
In this example, we select the image 1.jpg
from the group as the reconstruction target and the remaining images are reference images.
[{
"image_file": [
"2.jpg",
"3.jpg",
"4.jpg",
"5.jpg"
],
"text": [
"A fantasy character."
],
"target": "1.jpg"
}]
We only train the final layer of MLLM, the projection layer, and cross-attention adapters during pretraining. We provide the training script here.
During this stage, we jointly train the MLLM, its final layer, the projection layer, newly-added LoRA layers and cross-attention adapters. The model is initialized with the checkpoint trained by alignment pretraining. We provide the training script here.
We train the same components as the previous stage but use different training data and augmentations. EasyRef is trained on 32 A100 GPUs with 80GB memory using DeepSpeed ZeRO-2. To train on fewer GPUs, you can reduce the num_processes
in the script and increase the gradient_accumulation_steps accordingly.
We provide the training script with DeepSpeed here.
We provide the inference code of EasyRef with SDXL in easyref_demo.
- EasyRef performs best when provided with multiple reference images (more than 2).
- To ensure better identity preservation, we strongly recommend that users upload multiple square face images, ensuring the face occupies the majority of each image.
- Using multimodal prompts (both reference images and non-empty text prompt) can achieve better results.
- We set
scale=1.0
by default. Lowering thescale
value leads to more diverse but less consistent generation results.
- Our codebase is built upon the IP-Adapter repository.
- EasyRef is inspired by LI-DiT, IP-Adapter, ControlNet, InstantID, InstantStyle, PhotoMaker and other image personalization methods. Thanks for their great works!
This project is released under Apache License 2.0. We release our checkpoints for research purposes only. Users are granted the freedom to create images using this tool, but they are expected to comply with local laws and utilize it in a responsible manner. The developers do not assume any responsibility for potential misuse by users.
If you find EasyRef useful for your research and applications, please cite us using this BibTeX:
@article{easyref,
title={EasyRef: Omni-Generalized Group Image Reference for Diffusion Models via Multimodal LLM},
author={Zong, Zhuofan and Jiang, Dongzhi and Ma, Bingqi and Song, Guanglu and Shao, Hao and Shen, Dazhong and Liu, Yu and Li, Hongsheng},
journal={arXiv preprint arXiv:2412.09618},
year={2024}
}