DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation
Haoyu Zhao, Zhongang Qi, Cong Wang, Qingqing Zheng, Guansong Lu, Fei Chen, Hang Xu and Zuxuan Wu
video1.mov |
video2.mov |
Prompt: โThe person in the image is wearing a traditional outfit with intricate embroidery and embellishments. The outfit features a blue and gold color scheme with detailed floral patterns. The background is dark and blurred, which helps to highlight the person and their attire. The lighting is soft and warm, creating a serene and elegant atmosphere.โ |
Prompt: โThe person in the image is a woman with long, blonde hair styled in loose waves. She is wearing a form-fitting, sleeveless top with a high neckline and a small cutout at the chest. The top is beige and has a strap across her chest. She is also wearing a black belt with a pouch attached to it. Around her neck, she has a turquoise pendant necklace. The background appears to be a dimly lit, urban environment with a warm, golden glow." |
video3.mov |
video4.mov |
Prompt: โThe person in the image is wearing a black, form-fitting one-piece outfit and a pair of VR goggles. They are walking down a busy street with numerous people and colorful neon signs in the background. The street appears to be a bustling urban area, possibly in a city known for its vibrant nightlife and entertainment. The lighting and signage suggest a lively atmosphere, typical of a cityscape at night." |
Prompt: โThe image depicts a stylized, animated character standing amidst a chaotic and dynamic background. The character is dressed in a blue suit with a red cape, featuring a prominent "S" emblem on the chest. The suit has a belt with pouches and a utility belt. The character has spiky hair and is standing on a pile of debris and rubble, suggesting a scene of destruction or battle. The background is filled with glowing, fiery elements and a sense of motion, adding to the dramatic and intense atmosphere of the scene." |
TL; DR: DynamiCtrl is the first framework to introduce text to the human image animation task and achieve pose control within the MM-DiT architecture.
CLICK for the full abstract
Human image animation has recently gained significant attention due to advancements in generative models. However, existing methods still face two major challenges: (1) architectural limitationsโmost models rely on U-Net, which underperforms compared to the MM-DiT; and (2) the neglect of textual information, which can enhance controllability. In this work, we introduce DynamiCtrl, a novel framework that not only explores different pose-guided control structures in MM-DiT, but also reemphasizes the crucial role of text in this task. Specifically, we employ a Shared VAE encoder for both reference images and driving pose videos, eliminating the need for an additional pose encoder and simplifying the overall framework. To incorporate pose features into the full attention blocks, we propose Pose-adaptive Layer Norm (PadaLN), which utilizes adaptive layer normalization to encode sparse pose features. The encoded features are directly added to the visual input, preserving the spatiotemporal consistency of the backbone while effectively introducing pose control into MM-DiT. Furthermore, within the full attention mechanism, we align textual and visual features to enhance controllability. By leveraging text, we not only enable fine-grained control over the generated content, but also, for the first time, achieve simultaneous control over both background and motion. Experimental results verify the superiority of DynamiCtrl on benchmark datasets, demonstrating its strong identity preservation, heterogeneous character driving, background controllability, and high-quality synthesis. The source code will be made publicly available soon.
Click for Previous todos
- Release the project page and demos
- Paper on Arxiv
- Release inference code
- Release model
- Release training code
Code coming soon!
- 2025.03.30 Project page and demos released!
- 2025.03.10 Project Online!
We first refocus on the role of text for this task and find that fine-grained textual information helps improve video quality. In particular, we can achieve background controllability using different prompts.
![]() |
bg_video1.mp4 |
![]() |
bg_video2.mp4 |
Case (a) | "->Green trees" | Case (b) | "->Beautiful ocean" |
![]() |
bg_video5.mov |
||
Case (c) | "Across seven different backgrounds (Long video over 200 frames)" |
CLICK for the full prompts used in Case (c).
Scene 1: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a bustling futuristic city at night, with neon lights reflecting off the wet streets and flying cars zooming above.
Scene 2: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a vibrant market street in a Middle Eastern bazaar, filled with colorful fabrics, exotic spices, and merchants calling out to customers.
Scene 3: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a mystical ancient temple hidden deep in the jungle, covered in vines, with glowing runes carved into the stone walls.
Scene 4: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a sunny beach with golden sand, gentle ocean waves rolling onto the shore, and palm trees swaying in the breeze.
Scene 5: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows an abandoned industrial warehouse with broken windows, scattered debris, and rusted machinery covered in dust.
Scene 6: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a high-tech research lab with sleek metallic walls, glowing holographic screens, and robotic arms assembling futuristic devices.
Scene 7: The person in the image is wearing a white, knee-length dress with short sleeves and a square neckline. The dress features lace detailing and a ruffled hem. The person is also wearing clear, open-toed sandals. The background shows a serene snowy forest with tall pine trees, soft snowflakes falling gently, and a frozen river winding through the landscape.
re_video1.mov |
re_video2.mov |
re_video3.mov |
re_video4.mov |
Show cases: long video with 12 seconds, driving by the same audio.
person1_output.mp4 |
person2_output.mp4 |
The identities of the digital human are generated by vivo's BlueLM model (image generation).
Two steps to generate a digital human:
-
Prepare a human image and a guided pose video, and generate the video materials using our DynamiCtrl.
-
Use the output video and an audio file, and apply MuseTalk to generate the correct lip movements.
@article{zhao2025dynamictrl,
title={DynamiCtrl: Rethinking the Basic Structure and the Role of Text for High-quality Human Image Animation},
author={Haoyu, Zhao and Zhongang, Qi and Cong, Wang and Qingping, Zheng and Guansong, Lu and Fei, Chen and Hang, Xu and Zuxuan, Wu},
year={2025},
journal={arXiv:2503.21246},
}
This repository borrows heavily from CogVideoX. Thanks to the authors for sharing their code and models.
This is the codebase for our research work. We are still working hard to update this repo, and more details are coming in days.