Skip to content

WanImageToVideo, WanFirstLastFrameToVideo: Add vae_tile_size optional arg#10238

Open
alexheretic wants to merge 1 commit intoComfy-Org:masterfrom
alexheretic:wan-vae-tiled-encode
Open

WanImageToVideo, WanFirstLastFrameToVideo: Add vae_tile_size optional arg#10238
alexheretic wants to merge 1 commit intoComfy-Org:masterfrom
alexheretic:wan-vae-tiled-encode

Conversation

@alexheretic
Copy link
Contributor

@alexheretic alexheretic commented Oct 6, 2025

I experience slow VAE performance on my AMD RX 7900 GRE gpu and can usually improve this by opting for the tiled VAE nodes. However, WanImageToVideo does VAE encoding and is currently not configurable. This leads to wan workflows being slow for me, see benchmarks.

I propose we add a vae_tile_size optional argument to WanImageToVideo (and similar). By default this will be 0 to mean untiled, ie acting as it did previously. If set the value will be used as the x & y tile size. This allows users, like me, a way to workaround poor wan VAE untiled encode performance.

As the default behaviour is unchanged this should be backward compatible.

wan-vae-tile-size-screen

Alternatives

  • Add new "tiled" variant nodes for wan, e.g. TiledWanImageToVideo.
  • Automatically pick tiled encoding for certain GPUs, e.g. my gpu -> 256x256 tiled encoding.

Wan 2.1 VAE benchmarks (480x832 * 81 frames)

System info

MIOPEN_FIND_MODE=FAST

Total VRAM 16368 MB, total RAM 64217 MB
pytorch version: 2.9.0.dev20250827+rocm6.4
AMD arch: gfx1100
ROCm version: (6, 4)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7900 GRE : native
Using Flash Attention
Python version: 3.12.11 (main, Jun  4 2025, 10:32:37) [GCC 15.1.1 20250425]
ComfyUI version: 0.3.62
ComfyUI frontend version: 1.27.7
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.float16

VAE Encode

Benches show significant improvement using tiled vae encoding. On my setup 256x256 performed best. 589s -> 25s.

Untiled vs 512 vs 384 vs 256 vs 128

2 runs each.

untiled

Yes really 10 minutes 😞

[WanImageToVideo]: 608.79s
[WanImageToVideo]: 588.72s

tiled 512,512,32,256,8

[WanImageToVideo]: 41.86s
[WanImageToVideo]: 43.68s

tiled 384,384,32,256,8

[WanImageToVideo]: 30.41s
[WanImageToVideo]: 28.89s

tiled 256,256,32,256,8

[WanImageToVideo]: 25.00s
[WanImageToVideo]: 25.35s

tiled 128,128,32,256,8

[WanImageToVideo]: 45.57s
[WanImageToVideo]: 45.31s

VAE Decode

Benches also show significant improvement using tiled vae decoding. On my setup 256x256 performed best.
Note: Decoding is already a separate node so no code changes required, this is just kinda related and perhaps interesting.

Untiled vs 512 vs 384 vs 256 vs 128

4 runs each (where possible).

untiled

OOM 😢

tiled 512,512,32,124,8

OOM 😢

tiled 384,384,32,124,8

[VAEDecodeTiled]: 73.94s
[VAEDecodeTiled]: 99.03s
[VAEDecodeTiled]: 62.71s
[VAEDecodeTiled]: 66.34s

tiled 256,256,32,124,8

[VAEDecodeTiled]: 60.79s
[VAEDecodeTiled]: 61.21s
[VAEDecodeTiled]: 54.53s
[VAEDecodeTiled]: 47.72s

tiled 128,128,32,124,8

[VAEDecodeTiled]: 72.18s
[VAEDecodeTiled]: 71.70s
[VAEDecodeTiled]: 71.47s
[VAEDecodeTiled]: 71.29s

@alexheretic
Copy link
Contributor Author

@comfyanonymous this provides quite a big improvement for me (589s -> 25s encode time), and perhaps for other amd users too. wdyt?

@reneleonhardt
Copy link

Looks like a good improvement, could you add a test if the maintainers would merge it?

@comfy-pr-bot
Copy link
Member

Test Evidence Check

⚠️ Warning: Visual Documentation Missing

If this PR changes user-facing behavior, visual proof (screen recording or screenshot) is required. PRs without applicable visual documentation may not be reviewed until provided.

You can add it by:

  • GitHub: Drag & drop media directly into the PR description
  • YouTube: Include a link to a short demo

@alexheretic
Copy link
Contributor Author

I'd appreciate some guidance on this. Is this something that could get merged? Is one of the alternative approaches mentioned more attractive?

@rattus128
Copy link
Contributor

Hey sorry about the long delays.

From a node design point of view this is probably pointing out a flaw in way the tiler nodes are designed (being limited to encode and decode nodes). Can you one-shot everyone tilers needs by having a VAE-in VAE-out tiler node where you set the tiler config and all all consumers of that node use that tile config? That can then feed this, the usual wan I2V node and all the other video models the bring in a VAE?

Regarding WAN VAE specifically, its one of the gentler video VAEs for VRAM so its a weird on to have trouble on. The WAN VAE probably has signficant scope for straight up VRAM reduction using the recursive rolling upscale algorithm just implemented in the LTX VAE which is able to reduce the VRAM consumption as low as a 3 frame window (WAN is at 6 today), theres a sweet spot at 4 last time I did the math on it.

@rattus128 rattus128 self-assigned this Feb 10, 2026
@alexheretic
Copy link
Contributor Author

Thanks for the feedback, that approach makes sense I think. I'll take a look.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants