GitHub - HITsz-TMG/Anim-Director: The codes of Siggraph Asia 2024 paper "Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation"

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

SIGGRAPH Asia 2024

Yunxin Li, Haoyuan Shi, Baotian Hu*, Longyue Wang*,
Jiashun Zhu, Jinyi Xu, Zhen Zhao, and Min Zhang

(* Corresponding Authors)

Harbin Institute of Technology, Shenzhen

🚀 Welcome to the repo of Anim-Director.

If you appreciate our project, please consider giving us a star ⭐ on GitHub to stay updated with the latest developments.

🎏 Abstract

TL; DR: Anim-Director is an autonomous animation-making agent where LMM interacts seamlessly with generative tools to create detailed animated videos from simple narratives.

Traditional animation generation methods depend on training generative models with human-labelled data, entailing a sophisticated multi-stage pipeline that demands substantial human effort and incurs high training costs. Due to limited prompting plans, these methods typically produce brief, information-poor, and context-incoherent animations. To overcome these limitations and automate the animation process, we pioneer the introduction of large multimodal models (LMMs) as the core processor to build an autonomous animation-making agent, named Anim-Director. This agent mainly harnesses the advanced understanding and reasoning capabilities of LMMs and generative AI tools to create animated videos from concise narratives or simple instructions. Specifically, it operates in three main stages: Firstly, the Anim-Director generates a coherent storyline from user inputs, followed by a detailed director’s script that encompasses settings of character profiles and interior/exterior descriptions, and context-coherent scene descriptions that include appearing characters, interiors or exteriors, and scene events. Secondly, we employ LMMs with the image generation tool to produce visual images of settings and scenes. These images are designed to maintain visual consistency across different scenes using a visual-language prompting method that combines scene descriptions and images of the appearing character and setting. Thirdly, scene images serve as the foundation for producing animated videos, with LMMs generating prompts to guide this process. The whole process is notably autonomous without manual intervention, as the LMMs interact seamlessly with generative tools to generate prompts, evaluate visual quality, and select the best one to optimize the final output. To assess the effectiveness of our framework, we collect varied short narratives and incorporate various Image/video evaluation metrics including visual consistency and video quality. The experimental results and case studies demonstrate the Anim-Director’s versatility and significant potential to streamline animation creation.

⚔️ Overview

Given a narrative, Anim-Director first polishes the narrative and generates the director’s scripts using GPT-4. GPT-4 interacts with the image generation tools to produce the scene images through Image + Text → Image. Subsequently, the Anim-Director produces videos based on the generated scene images and textual prompts, i.e., Image + Text → Video. To improve the quality of images and videos, we realize deep interaction between LMMs and generative tools, enabling GPT-4 to refine, evaluate, and select the best candidate by self-reflection reasoning pathway.

🌈 Visual Example

A visual example of Anim-Director.

🎨 Comparison

DPT-T2I	CustomDiffusion	Anim Director (Ours)

Scene #1: Tim stands with an earnest look, facing Tim's mother who is kneeling and focused on her gardening.

Scene #2: Tim is holding a red round ball with a smile under a tree, surrounded by vibrant green grass.

Scene #3: Tim sets the red round ball aside and looks onwards, the big oak's wide shadow covering him.

Scene #4: Tim stands amidst dazzling flowers and looks around, holding a green rectangular shovel.

Scene #5: Tim puts down the rectangular shovel and continues his search around the colorful flowers.

Scene #6: Tim walks from the colorful flowers to the old swing set.

Scene #7: Tim carefully navigates through thick grass around the faded old swing set.

Scene #8: Tim finds the blue toy car under leaves near the old swing set, his face lighting up with joy.

Scene #9: Tim is preparing to return to his yard with the blue toys he found.

Scene #10: Tim is immersed in play in the cluttered yard, the blue toy car is put into the yard.

🌰 More Examples

demo1_compressed.mp4

A compressed version of generated Demo 1.

demo2_compressed.mp4

A compressed version of generated Demo 2.

demo3_compressed.mp4

A compressed version of generated Demo 3.

⚡️ Usage

Attention

Midjourney and Pika are paid, while Stable Diffusion 3 and PIA are free. If you want to achieve the animation generation effect shown in our paper and demo, please choose Midjourney for T2I and Pika for (T+I)2V.
Welcome to contact us for more details (including how to integrate Pika into our agent).

Setup

Prepare Environment

conda create -n AnimDirector python==3.10.11
conda activate AnimDirector
pip install -r requirements.txt

To use Stable Diffusion 3 for T2I, you need to upgrade your torch version along with all related packages.

Prepare Checkpoints For PIA

To use PIA for (T+I)2V, you need to prepare the following checkpoints.

Download the Stable Diffusion v1-5

conda install git-lfs
git lfs install
git clone https://huggingface.co/runwayml/stable-diffusion-v1-5 models/StableDiffusion/

Download PIA

git clone https://huggingface.co/Leoxing/PIA models/PIA/

Download Personalized Models

bash download_bashscripts/2-RcnzCartoon.sh

Prepare stable-diffusion-webui

To use MJ for T2I, you need to prepare stable-diffusion-webui following instructions Here.
After that, run:

bash code/StableDiffusion/webui.sh --nowebui

Prepare Imgur API

Sign up for an Imgur account.
Obtain your Imgur client_id, client_secret, access_token, refresh_token following instructions Here.

Inference for scripts generation

Run the following command to get the scripts:

python code/script_gen.py

The generated scripts will be saved as code/result/scripts.json.

Inference for T2I with Midjourney

Since Midjourney does not provide official API services, we use a third-party API platform GoAPI for mj_api_key.
Run the following command to get the (T+I)2V results:

python code/image_gen_mj.py

The generated images will be saved in code/result/image/mj.

Inference for T2I with Stable Diffusion 3

Run the following command to get the (T+I)2V results:

python code/image_gen_pia.py

The generated images will be saved in code/result/image/sd3.

Inference for (T+I)2V with with PIA

Run the following command to get the (T+I)2V results:

python code/video_gen.py

The generated videos will be saved in code/result/video.

Citation

@article{li2024anim,
  title={Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation},
  author={Li, Yunxin and Shi, Haoyuan and Hu, Baotian and Wang, Longyue and Zhu, Jiashun and Xu, Jinyi and Zhao, Zhen and Zhang, Min},
  journal={arXiv preprint arXiv:2408.09787},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
assets		assets
code		code
dataset		dataset
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

🎏 Abstract

⚔️ Overview

🌈 Visual Example

🎨 Comparison

🌰 More Examples

⚡️ Usage

Attention

Setup

Prepare Environment

Prepare Checkpoints For PIA

Prepare stable-diffusion-webui

Prepare Imgur API

Inference for scripts generation

Inference for T2I with Midjourney

Inference for T2I with Stable Diffusion 3

Inference for (T+I)2V with with PIA

Citation

About

Releases

Packages

Contributors 2

Languages

HITsz-TMG/Anim-Director

Folders and files

Latest commit

History

Repository files navigation

Anim-Director: A Large Multimodal Model Powered Agent for Controllable Animation Video Generation

🎏 Abstract

⚔️ Overview

🌈 Visual Example

🎨 Comparison

🌰 More Examples

⚡️ Usage

Attention

Setup

Prepare Environment

Prepare Checkpoints For PIA

Prepare stable-diffusion-webui

Prepare Imgur API

Inference for scripts generation

Inference for T2I with Midjourney

Inference for T2I with Stable Diffusion 3

Inference for (T+I)2V with with PIA

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages