Yunxin Li, Haoyuan Shi, Baotian Hu*, Longyue Wang*,
Jiashun Zhu, Jinyi Xu, Zhen Zhao, and Min Zhang
(* Corresponding Authors)
Harbin Institute of Technology, Shenzhen
🚀 Welcome to the repo of Anim-Director.
If you appreciate our project, please consider giving us a star ⭐ on GitHub to stay updated with the latest developments.
TL; DR: Anim-Director is an autonomous animation-making agent where LMM interacts seamlessly with generative tools to create detailed animated videos from simple narratives.
Given a narrative, Anim-Director first polishes the narrative and generates the director’s scripts using GPT-4. GPT-4 interacts with the image generation tools to produce the scene images through Image + Text → Image. Subsequently, the Anim-Director produces videos based on the generated scene images and textual prompts, i.e., Image + Text → Video. To improve the quality of images and videos, we realize deep interaction between LMMs and generative tools, enabling GPT-4 to refine, evaluate, and select the best candidate by self-reflection reasoning pathway. A visual example of Anim-Director.Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director’s script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output. To assess the effectiveness of our framework, we collect varied short narratives and incorporate various Image/video evaluation metrics including visual consistency and video quality. The experimental results and case studies demonstrate the Anim-Director’s versatility and significant potential to streamline animation creation.
demo1_compressed.mp4
demo2_compressed.mp4
demo3_compressed.mp4
Midjourney and Pika are paid, while Stable Diffusion 3 and PIA are free. If you want to achieve the animation generation effect shown in our paper and demo, please choose Midjourney for T2I and Pika for (T+I)2V.
Welcome to contact us for more details (including how to integrate Pika into our agent).
conda create -n AnimDirector python==3.10.11
conda activate AnimDirector
pip install -r requirements.txt
To use Stable Diffusion 3 for T2I, you need to upgrade your torch version along with all related packages.
To use PIA for (T+I)2V, you need to prepare the following checkpoints.conda install git-lfs
git lfs install
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 models/StableDiffusion/
git clone https://huggingface.co/Leoxing/PIA models/PIA/
bash download_bashscripts/2-RcnzCartoon.sh
To use MJ for T2I, you need to prepare stable-diffusion-webui following instructions Here.
After that, run:
bash code/StableDiffusion/webui.sh --nowebui
Sign up for an Imgur account.
Obtain your Imgur client_id, client_secret, access_token, refresh_token following instructions Here.
Run the following command to get the scripts:
python code/script_gen.py
- The generated scripts will be saved as
code/result/scripts.json
.
Since Midjourney does not provide official API services, we use a third-party API platform GoAPI for mj_api_key.
Run the following command to get the (T+I)2V results:
python code/image_gen_mj.py
- The generated images will be saved in
code/result/image/mj
.
Run the following command to get the (T+I)2V results:
python code/image_gen_pia.py
- The generated images will be saved in
code/result/image/sd3
.
Run the following command to get the (T+I)2V results:
python code/video_gen.py
- The generated videos will be saved in
code/result/video
.
@article{li2024anim,
title={Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation},
author={Li, Yunxin and Shi, Haoyuan and Hu, Baotian and Wang, Longyue and Zhu, Jiashun and Xu, Jinyi and Zhao, Zhen and Zhang, Min},
journal={arXiv preprint arXiv:2408.09787},
year={2024}
}