-
Notifications
You must be signed in to change notification settings - Fork 6.5k
[WIP]Add Wan2.2 Animate Pipeline (Continuation of #12442 by tolgacangoz) #12526
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Introduced WanAnimateTransformer3DModel and WanAnimatePipeline. - Updated get_transformer_config to handle the new model type. - Modified convert_transformer to instantiate the correct transformer based on model type. - Adjusted main execution logic to accommodate the new Animate model type.
…prove error handling for undefined parameters
…work for character animation and replacement - Added Wan 2.2 Animate 14B model to the documentation. - Introduced the Wan-Animate framework, detailing its capabilities for character animation and replacement. - Included example usage for the WanAnimatePipeline with preprocessing steps and guidance on input requirements.
- Introduced `WanAnimateGGUFSingleFileTests` to validate functionality. - Added dummy input generation for testing model behavior.
- Introduced `EncoderApp`, `Encoder`, `Direction`, `Synthesis`, and `Generator` classes for enhanced motion and appearance encoding. - Added `FaceEncoder`, `FaceBlock`, and `FaceAdapter` classes to integrate facial motion processing. - Updated `WanTimeTextImageMotionEmbedding` to utilize the new `Generator` for motion embedding. - Enhanced `WanAnimateTransformer3DModel` with additional face adapter and pose patch embedding for improved model functionality.
- Introduced `pad_video` method to handle padding of video frames to a target length. - Updated video processing logic to utilize the new padding method for `pose_video`, `face_video`, and conditionally for `background_video` and `mask_video`. - Ensured compatibility with existing preprocessing steps for video inputs.
…roved video processing - Added optional parameters: `conditioning_pixel_values`, `refer_pixel_values`, `refer_t_pixel_values`, `bg_pixel_values`, and `mask_pixel_values` to the `prepare_latents` method. - Updated the logic in the denoising loop to accommodate the new parameters, enhancing the flexibility and functionality of the pipeline.
…eneration - Updated the calculation of `num_latent_frames` and adjusted the shape of latent tensors to accommodate changes in frame processing. - Enhanced the `get_i2v_mask` method for better mask generation, ensuring compatibility with new tensor shapes. - Improved handling of pixel values and device management for better performance and clarity in the video processing pipeline.
…and mask generation - Consolidated the handling of `pose_latents_no_ref` to improve clarity and efficiency in latent tensor calculations. - Updated the `get_i2v_mask` method to accept batch size and adjusted tensor shapes accordingly for better compatibility. - Enhanced the logic for mask pixel values in the replacement mode, ensuring consistent processing across different scenarios.
…nced processing - Introduced custom QR decomposition and fused leaky ReLU functions for improved tensor operations. - Implemented upsampling and downsampling functions with native support for better performance. - Added new classes: `FusedLeakyReLU`, `Blur`, `ScaledLeakyReLU`, `EqualConv2d`, `EqualLinear`, and `RMSNorm` for advanced neural network layers. - Refactored `EncoderApp`, `Generator`, and `FaceBlock` classes to integrate new functionalities and improve modularity. - Updated attention mechanism to utilize `dispatch_attention_fn` for enhanced flexibility in processing.
…annotations - Removed extra-abstractioned-functions such as `custom_qr`, `fused_leaky_relu`, and `make_kernel` to streamline the codebase. - Updated class constructors and method signatures to include type hints for better clarity and type checking. - Refactored the `FusedLeakyReLU`, `Blur`, `EqualConv2d`, and `EqualLinear` classes to enhance readability and maintainability. - Simplified the `Generator` and `Encoder` classes by removing redundant parameters and improving initialization logic.
…rmer tests passing
…ifferent preprocessing logic than the conditioning videos
|
Here are some Animation: wan_animate_video_20_step.mp4Replacement: wan_animate_video_replace_20_step.mp4 |
yiyixuxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments.
It was a pleasure to review! really awesome work! @dg845
| VAE scale factor. If `do_resize` is `True`, the image is automatically resized to multiples of this factor. | ||
| resample (`str`, *optional*, defaults to `lanczos`): | ||
| Resampling filter to use when resizing the image. | ||
| resample (`str`, *optional*, defaults to `"lanczos"`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a new WanVaeImageProcessor(VaeImageProcessor) and put into wan folder, under utils.py file I think?
(we start to see more and more custom preprocess methods, almost everyone does and they don't really get reused across models, I think moving forward let's just do this for all new models)
cc @DN6 here too, let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes which make _resize_and_fill and _resize_and_crop respect self.config.resample should be added to the base VaeImageProcessor class; this could also be spun off into its own PR. I agree with moving the other (Wan Animate-specific logic) into its own class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good
| def __repr__(self): | ||
| return ( | ||
| f"{self.__class__.__name__}({self.weight.shape[1]}, {self.weight.shape[0]}," | ||
| f" kernel_size={self.weight.shape[2]}, stride={self.stride}, padding={self.padding})" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def __repr__(self): | |
| return ( | |
| f"{self.__class__.__name__}({self.weight.shape[1]}, {self.weight.shape[0]}," | |
| f" kernel_size={self.weight.shape[2]}, stride={self.stride}, padding={self.padding})" | |
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kind of like having this because if you print(model) then MotionConv2d will print out info similar to torch.nn.Conv2d (same with MotionLinear below). But I'm also fine with removing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok if you feel strongly about keeping it!
| def __repr__(self): | ||
| return ( | ||
| f"{self.__class__.__name__}(in_features={self.weight.shape[1]}, out_features={self.weight.shape[0]}," | ||
| f" bias={self.bias is not None})" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| def __repr__(self): | |
| return ( | |
| f"{self.__class__.__name__}(in_features={self.weight.shape[1]}, out_features={self.weight.shape[0]}," | |
| f" bias={self.bias is not None})" | |
| ) |
| hidden_states = hidden_states.flatten(2).transpose(1, 2) | ||
|
|
||
| # 3. Condition embeddings (time, text, image) | ||
| # timestep shape: batch_size, or batch_size, seq_len (wan 2.2 ti2v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move one of these conditions for animate, no?
|
|
||
| self.gradient_checkpointing = False | ||
|
|
||
| def motion_batch_encode( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we move this to forward? all the layers (motion_encoder here) should be visible in forward
| hidden_states = block(hidden_states, encoder_hidden_states, timestep_proj, rotary_emb) | ||
|
|
||
| # Face adapter integration: apply after every 5th block (0, 5, 10, 15, ...) | ||
| if block_idx % self.config.inject_face_latents_blocks == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh does it make sense to create a WanAnimateTransformerBlock that includes face_adapter?
e.g.
# In __init__
self.blocks = nn.ModuleList(
[
WanAnimateTransformerBlock(
....,
add_face_adapter=(i % inject_face_latents_blocks == 0),
)
for i in range(num_layers)
]
)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yiyixuxu my understanding is that the main advantage of creating a WanAnimateTransformerBlock would be that if there is an associated face adapter block, it will always stay on the same device as the main transformer block (since we would put WanAnimateTransformerBlock in _no_split_modules). I think the code could look something like:
class WanAnimateTransformerBlock(nn.Module):
def __init__(self, ..., add_face_adapter: bool = False):
super().__init__()
self.transformer_block = WanTransformerBlock(...)
self.face_cross_attn = None
if add_face_adapter:
self.face_cross_attn = WanAnimateFaceBlockCrossAttention(...)
def forward(self, hidden_states, motion_vec, ...):
hidden_states = self.transformer_block(hidden_states, ...)
if self.face_cross_attn is not None:
face_adapter_output = self.face_cross_attn(hidden_states, motion_vec)
hidden_states = face_adapter_output + hidden_states
return hidden_statesI think it is clearest to reuse WanTransformerBlock here so that it is easy to tell which logic is inherited from the base Wan 2.1 model and which is Wan Animate-specific. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh so we are moving toward single-file format for models, so basically let's assume people who look through wan animate transformer do not need to know anything about wan transformer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i left commeenets here too https://github.com/huggingface/diffusers/pull/12526/files#r2512642810
ok if you want to re-use the block! but let's not import from the wan tranformer file
|
|
||
| hidden_states_original_dtype = hidden_states.dtype | ||
| hidden_states = self.norm_out(hidden_states.float()) | ||
| # Move the shift and scale tensors to the same device as hidden_states. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh let's try to fix it here
I think all we need to do is to pack shift and scale into same layer and add that layere into _no_split_modules attribute
| >>> face_video = load_video("path/to/face_video.mp4") | ||
| >>> # Calculate optimal dimensions based on VAE constraints | ||
| >>> max_area = 480 * 832 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we make a vaeimageprocessor for wan, this can be added there too
| latents = (latents - latents_mean) * latents_recip_std | ||
| return latents | ||
|
|
||
| def destandardize_latents(self, latents: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put this back into pipeline, it is always part of pipeline
| f" {type(prev_segment_conditioning_frames)} and value is {prev_segment_conditioning_frames}" | ||
| ) | ||
|
|
||
| def standardize_latents(self, latents: torch.Tensor) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we put this back into pipeline? let's try not to have too many small methods. and the normalize/denoramlize is always part of the pipeline code
| from ..modeling_outputs import Transformer2DModelOutput | ||
| from ..modeling_utils import ModelMixin | ||
| from ..normalization import FP32LayerNorm | ||
| from .transformer_wan import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed a few imports here, I'm ok if you want to re-use WanTransformerBlock - but let's copy it here instead of importing from a different model file
What does this PR do?
This PR is a continuation of #12442 by @tolgacangoz. It adds a pipeline for the Wan2.2-Animate-14B model (project page, paper, code, weights), a SOTA character animation and replacement video model.
Fixes #12441 (the original requesting issue).
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@yiyixuxu
@sayakpaul
@tolgacangoz