GitHub - gulucaptain/DynamiCtrl: Dynamic human image animation with strong identity preservation, heterogeneous character driving, and controllable backgrounds.

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

Haoyu Zhao, Zhongang Qi, Cong Wang, Qingqing Zheng, Guansong Lu, Fei Chen, Hang Xu and Zuxuan Wu

YouTube Overview

Please click to watch.

Generation with Image, Pose and Prompts

video1.mov	video2.mov
Prompt: “The person in the image is wearing a traditional outfit with intricate embroidery and embellishments. The outfit features a blue and gold color scheme with detailed floral patterns. The background is dark and blurred, which helps to highlight the person and their attire. The lighting is soft and warm, creating a serene and elegant atmosphere.”	Prompt: “The person in the image is a woman with long, blonde hair styled in loose waves. She is wearing a form-fitting, sleeveless top with a high neckline and a small cutout at the chest. The top is beige and has a strap across her chest. She is also wearing a black belt with a pouch attached to it. Around her neck, she has a turquoise pendant necklace. The background appears to be a dimly lit, urban environment with a warm, golden glow."
video3.mov	video4.mov
Prompt: “The person in the image is wearing a black, form-fitting one-piece outfit and a pair of VR goggles. They are walking down a busy street with numerous people and colorful neon signs in the background. The street appears to be a bustling urban area, possibly in a city known for its vibrant nightlife and entertainment. The lighting and signage suggest a lively atmosphere, typical of a cityscape at night."	Prompt: “The image depicts a stylized, animated character standing amidst a chaotic and dynamic background. The character is dressed in a blue suit with a red cape, featuring a prominent "S" emblem on the chest. The suit has a belt with pouches and a utility belt. The character has spiky hair and is standing on a pile of debris and rubble, suggesting a scene of destruction or battle. The background is filled with glowing, fiery elements and a sense of motion, adding to the dramatic and intense atmosphere of the scene."

🎏 Abstract

TL; DR: DynamiCtrl is the first framework to introduce text to the human image animation task and achieve pose control within the MM-DiT architecture.

CLICK for the full abstract

Human image animation has recently gained significant attention due to advancements in generative models. However, existing methods still face two major challenges: (1) architectural limitations—most models rely on U-Net, which underperforms compared to the MM-DiT; and (2) the neglect of textual information, which can enhance controllability. In this work, we introduce DynamiCtrl, a novel framework that not only explores different pose-guided control structures in MM-DiT, but also reemphasizes the crucial role of text in this task. Specifically, we employ a Shared VAE encoder for both reference images and driving pose videos, eliminating the need for an additional pose encoder and simplifying the overall framework. To incorporate pose features into the full attention blocks, we propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features. The encoded features are directly added to the visual input, preserving the spatiotemporal consistency of the backbone while effectively introducing pose control into MM-DiT. Furthermore, within the full attention mechanism, we align textual and visual features to enhance controllability. By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion. Experimental results verify the superiority of DynamiCtrl on benchmark datasets, demonstrating its strong identity preservation, heterogeneous character driving, background controllability, and high-quality synthesis. The source code will be made publicly available soon.

🚧 Todo

Click for Previous todos

Release the project page and demos
Paper on Arxiv

Release inference code
Release model
Release training code

📋 Changelog

Code coming soon!

2025.03.30 Project page and demos released!
2025.03.10 Project Online!

⚔️ DynamiCtrl Human Motion Video Generation

Background Control (contains long video performance)

We first refocus on the role of text for this task and find that fine-grained textual information helps improve video quality. In particular, we can achieve background controllability using different prompts.

	bg_video1.mp4		bg_video2.mp4
Case (a)	"->Green trees"	Case (b)	"->Beautiful ocean"
		bg_video5.mov
	Case (c)	"Across seven different backgrounds (Long video over 200 frames)"

CLICK for the full prompts used in Case (c).

Scene 1: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a bustling futuristic city at night, with neon lights reflecting off the wet streets and flying cars zooming above.

Scene 2: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a vibrant market street in a Middle Eastern bazaar, filled with colorful fabrics, exotic spices, and merchants calling out to customers.

Scene 3: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a mystical ancient temple hidden deep in the jungle, covered in vines, with glowing runes carved into the stone walls.

Scene 4: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a sunny beach with golden sand, gentle ocean waves rolling onto the shore, and palm trees swaying in the breeze.

Scene 5: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows an abandoned industrial warehouse with broken windows, scattered debris, and rusted machinery covered in dust.

Scene 6: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a high-tech research lab with sleek metallic walls, glowing holographic screens, and robotic arms assembling futuristic devices.

Scene 7: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a serene snowy forest with tall pine trees, soft snowflakes falling gently, and a frozen river winding through the landscape.

Cross-identity Retargetting

re_video1.mov	re_video2.mov
re_video3.mov	re_video4.mov

Applications: Digital Human (contains long video performance)

Show cases: long video with 12 seconds, driving by the same audio.

person1_output.mp4

person2_output.mp4

The identities of the digital human are generated by vivo's BlueLM model (image generation).

Two steps to generate a digital human:

Prepare a human image and a guided pose video, and generate the video materials using our DynamiCtrl.
Use the output video and an audio file, and apply MuseTalk to generate the correct lip movements.

📍 Citation

@article{zhao2025dynamictrl,
      title={DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation}, 
      author={Haoyu, Zhao and Zhongang, Qi and Cong, Wang and Qingping, Zheng and Guansong, Lu and Fei, Chen and Hang, Xu and Zuxuan, Wu},
      year={2025},
      journal={arXiv:2503.21246},
}

💗 Acknowledgements

This repository borrows heavily from CogVideoX. Thanks to the authors for sharing their code and models.

🧿 Maintenance

This is the codebase for our research work. We are still working hard to update this repo, and more details are coming in days.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

YouTube Overview

Generation with Image, Pose and Prompts

🎏 Abstract

🚧 Todo

📋 Changelog

⚔️ DynamiCtrl Human Motion Video Generation

Background Control (contains long video performance)

Cross-identity Retargetting

Applications: Digital Human (contains long video performance)

📍 Citation

💗 Acknowledgements

🧿 Maintenance

About

Releases

Packages

gulucaptain/DynamiCtrl

Folders and files

Latest commit

History

Repository files navigation

DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation

YouTube Overview

Generation with Image, Pose and Prompts

🎏 Abstract

🚧 Todo

📋 Changelog

⚔️ DynamiCtrl Human Motion Video Generation

Background Control (contains long video performance)

Cross-identity Retargetting

Applications: Digital Human (contains long video performance)

📍 Citation

💗 Acknowledgements

🧿 Maintenance

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages