[WIP] Wan2.2 by yiyixuxu · Pull Request #12004 · huggingface/diffusers

yiyixuxu · 2025-07-28T11:47:15Z

install from PR

pip install git+https://github.com/huggingface/diffusers.git@wan2.2

TI2V (only Text-to-image is supported for now, adding I2V soon)

import torch
import numpy as np
from diffusers import WanPipeline, AutoencoderKLWan, WanTransformer3DModel, UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image

dtype = torch.bfloat16
device = "cuda"

model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.to(device)

height = 704
width = 1280
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0


prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"

output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=num_frames,
    guidance_scale=guidance_scale,
    num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "5bit2v_output.mp4", fps=24)

14B T2V

import torch
import numpy as np
from diffusers import WanPipeline, AutoencoderKLWan
from diffusers.utils import export_to_video, load_image

dtype = torch.bfloat16
device = "cuda:2"
vae = AutoencoderKLWan.from_pretrained("Wan-AI/Wan2.2-T2V-A14B-Diffusers", subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained("Wan-AI/Wan2.2-T2V-A14B-Diffusers", vae=vae, torch_dtype=dtype)
pipe.to(device)

height = 720
width = 1280


prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
output = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=4.0,
    guidance_scale_2=3.0,
    num_inference_steps=40,
).frames[0]
export_to_video(output, "t2v_out.mp4", fps=16)

14B I2V

import torch
import numpy as np
from diffusers import WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image

model_id = "Wan-AI/Wan2.2-I2V-A14B-Diffusers"
dtype = torch.bfloat16
device = "cuda"

pipe = WanImageToVideoPipeline.from_pretrained(model_id, torch_dtype=dtype)
pipe.to(device)


image = load_image(
    "https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/wan_i2v_input.JPG"
)
max_area = 480 * 832
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height))
prompt = "Summer beach vacation style, a white cat wearing sunglasses sits on a surfboard. The fluffy-furred feline gazes directly at the camera with a relaxed expression. Blurred beach scenery forms the background featuring crystal-clear waters, distant green hills, and a blue sky dotted with white clouds. The cat assumes a naturally relaxed posture, as if savoring the sea breeze and warm sunlight. A close-up shot highlights the feline's intricate details and the refreshing atmosphere of the seaside."

negative_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"
generator = torch.Generator(device=device).manual_seed(0)
output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=3.5,
    num_inference_steps=40,
    generator=generator,
).frames[0]
export_to_video(output, "i2v_output.mp4", fps=16)

HuggingFaceDocBuilderDev · 2025-07-28T11:54:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

src/diffusers/models/transformers/transformer_wan.py

Akshat-Tripathi · 2025-07-28T14:37:05Z

src/diffusers/models/autoencoders/autoencoder_kl_wan.py


+def patchify(x, patch_size):
+    # YiYi TODO: refactor this
+    from einops import rearrange


Hi, I think it might work for newer versions of torch: https://github.com/arogozhnikov/einops/wiki/Using-torch.compile-with-einops

thanks for the insight!

okaris · 2025-07-28T15:46:08Z

@yiyixuxu thanks for releasing this so quickly! we are having some issues trying to get 5b i2v work. afai understand 5b is both for t2v and i2v. i tried a naive hack to copy the model.index.json of the 14b i2v but it didn't quite help.

yiyixuxu · 2025-07-28T16:35:57Z

@okaris 5b i2v is not supported yet - will look to add it today

okaris · 2025-07-28T16:37:30Z

@yiyixuxu thanks for the quick reply. happy to contribute if you can point me in the right direction.

Co-authored-by: bagheera <59658056+bghira@users.noreply.github.com>

src/diffusers/models/transformers/transformer_wan.py

a-r-r-o-w

Thanks YiYi! Just nits. Will add docs in follow-up as discussed. I think we should remove the changes to the test files here (Wan2.2 dual transformer should be tested separately instead of combining with Wan2.1 tests, such that both are fully tested).

a-r-r-o-w · 2025-07-28T21:01:10Z

src/diffusers/models/autoencoders/autoencoder_kl_wan.py

 CACHE_T = 2


+class AvgDown3D(nn.Module):


Maybe prefix these classes with Wan to follow same naming convention

a-r-r-o-w · 2025-07-28T21:09:17Z

src/diffusers/models/autoencoders/autoencoder_kl_wan.py

            2.8251,
            1.9160,
        ],
+        is_residual: bool = False,


LGTM for now, but ideally, we should just make a separate AutoencoderKLWan2_2 because the structure and internal blocks is different and try to standardize having single-file implementations per model type, similar to transformers. All the if-branching makes things a little harder to reverse engineer and increases barrier for entry for someone wanting to look at the implementations for study purposes IMO.

a-r-r-o-w · 2025-07-28T21:18:20Z

src/diffusers/models/transformers/transformer_wan.py

-        shift_msa, scale_msa, gate_msa, c_shift_msa, c_scale_msa, c_gate_msa = (
-            self.scale_shift_table + temb.float()
-        ).chunk(6, dim=1)
+        if temb.ndim == 4:


Same comment as VAE, ideally this should be in separate transformer implementation, transformer_wan_2_2.py, if we want to adopt single file properly

sounds good

I think vae can have its own class, feel free to refactor it if you prefer!
transformer change is really minimum and we could refactor further so it only a single code path, i.e. we just need to always expand timesteps inputs to be 2d. ( I did not have time to test it out so I kept if else here)

jingw193 · 2025-07-31T08:02:55Z

Hello, @yiyixuxu, I generated a video (https://github.com/user-attachments/assets/ce6ebaf1-8478-4c29-9170-57d5ae854a7d) using the code below and noticed a slight grainy texture. Is this expected behavior, and does it match the results you observed during your testing?

`import torch
import numpy as np
from diffusers import WanPipeline, AutoencoderKLWan, WanTransformer3DModel, UniPCMultistepScheduler
from diffusers.utils import export_to_video, load_image

dtype = torch.bfloat16
device = "cuda"

model_id = "Wan-AI/Wan2.2-TI2V-5B-Diffusers"
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanPipeline.from_pretrained(model_id, vae=vae, torch_dtype=dtype)
pipe.to(device)

height = 704
width = 1280
num_frames = 121
num_inference_steps = 50
guidance_scale = 5.0

prompt = "Two anthropomorphic cats in comfy boxing gear and bright gloves fight intensely on a spotlighted stage."
negative_prompt = "色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"

output = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
height=height,
width=width,
num_frames=num_frames,
guidance_scale=guidance_scale,
num_inference_steps=num_inference_steps,
).frames[0]
export_to_video(output, "5bit2v_output.mp4", fps=24)
`

agneet42 · 2025-08-02T04:04:29Z

Hi,
Thanks for the great work as always.
I wanted to understand if Wan 2.2 14B Diffusers currently supports multi-GPU inference? I had a quick stab at the code here : https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers

It does not seem to work? Although the checklist mentions multi-gpu support, I'm not sure if that's for the diffusers version?

* support wan 2.2 i2v * add t2v + vae2.2 * add conversion script for vae 2.2 * add * add 5b t2v * conversion script * refactor out reearrange * remove a copied from in skyreels * Apply suggestions from code review Co-authored-by: bagheera <59658056+bghira@users.noreply.github.com> * Update src/diffusers/models/transformers/transformer_wan.py * fix fast tests * style --------- Co-authored-by: bagheera <59658056+bghira@users.noreply.github.com>

yiyixuxu added 6 commits July 25, 2025 13:29

support wan 2.2 i2v

7ba78d5

add t2v + vae2.2

f5da83c

add conversion script for vae 2.2

95a55f9

add

bf2c6e0

add 5b t2v

27ce75b

conversion script

5709f7e

bghira reviewed Jul 28, 2025

View reviewed changes

src/diffusers/models/transformers/transformer_wan.py Outdated Show resolved Hide resolved

bghira reviewed Jul 28, 2025

View reviewed changes

yiyixuxu added 2 commits July 28, 2025 20:07

refactor out reearrange

0ab2e4f

remove a copied from in skyreels

e78a4fa

yiyixuxu requested a review from a-r-r-o-w July 28, 2025 18:09

Apply suggestions from code review

97675c7

Co-authored-by: bagheera <59658056+bghira@users.noreply.github.com>

yiyixuxu commented Jul 28, 2025

View reviewed changes

src/diffusers/models/transformers/transformer_wan.py Outdated Show resolved Hide resolved

yiyixuxu added 4 commits July 28, 2025 08:13

Update src/diffusers/models/transformers/transformer_wan.py

6bb1677

fix fast tests

1ff7c99

Merge branch 'wan2.2' of github.com:huggingface/diffusers into wan2.2

c0af998

style

4fd0333

a-r-r-o-w approved these changes Jul 28, 2025

View reviewed changes

yiyixuxu merged commit a6d9f6a into main Jul 28, 2025
14 of 15 checks passed

yiyixuxu deleted the wan2.2 branch July 29, 2025 01:49

yiyixuxu mentioned this pull request Aug 3, 2025

[wan2.2] follow-up #12024

Merged

mayankagrawal10198 mentioned this pull request Aug 6, 2025

[Wan 2.2 LoRA] add support for 2nd transformer lora loading + wan 2.2 lightx2v lora #12074

Merged

Conversation

yiyixuxu commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Jul 28, 2025

Uh oh!

Uh oh!

This comment was marked as resolved.

Uh oh!

Akshat-Tripathi Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

bghira Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

okaris commented Jul 28, 2025

Uh oh!

yiyixuxu commented Jul 28, 2025

Uh oh!

okaris commented Jul 28, 2025

Uh oh!

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

yiyixuxu Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jingw193 commented Jul 31, 2025

Uh oh!

agneet42 commented Aug 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

yiyixuxu commented Jul 28, 2025 •

edited

Loading