speed up and reduce VRAM of QWEN VAE and WAN (less so) by rattus128 · Pull Request #12036 · Comfy-Org/ComfyUI

rattus128 · 2026-01-23T02:47:16Z

This is a fast path for QWEN VAE by taking advantage of the fact that zero padding a 3d convolution on a single frame is mathematically equivalent to just slicing the weight.

I dont have a good explanation for why it is this much faster. I was expecting VRAM savings and hoping for some speed, but I got less VRAM and a ton of speedup. pytorch works in mysterious ways.

Example test conditions:

RTX5090

QWEN VAE Encode -> Decode (3840x2160)

Before:

Requested to load WanVAE
loaded completely; 7324.16 MB usable, 242.03 MB loaded, full load: True
0 models unloaded.
Unloaded partially: 242.03 MB freed, 0.00 MB remains loaded, 22.78 MB buffer reserved, lowvram patches: 0
Prompt executed in 36.82 seconds

After:

Requested to load WanVAE
loaded completely; 7324.16 MB usable, 242.03 MB loaded, full load: True
0 models unloaded.
Unloaded partially: 242.03 MB freed, 0.00 MB remains loaded, 22.78 MB buffer reserved, lowvram patches: 0
Prompt executed in 2.42 seconds

WAN2.2 VAE Encode -> Decode (1920x1088x81f)

Before (31.8GB):

Requested to load WanVAE
loaded completely; 7148.38 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 51.00 seconds

After(30.8GB):

Requested to load WanVAE
loaded completely; 7148.38 MB usable, 242.03 MB loaded, full load: True
Prompt executed in 41.56 seconds

This WAN test point is probably thrashing the pytorch and cuda-malloc-async allocators to stay on the VRAM ceiling which is why its more of a speedup than a VRAM saving.

This works around pytorch missing ability to causal pad as part of the kernel and avoids massive weight duplications for padding.

This currently uses F.pad which takes a full deep copy and is liable to be the VRAM peak. Instead, kick spatial padding back to the op and consolidate the temporal padding with the cat for the cache.

The WAN VAE is also QWEN where it is used single-image. These convolutions are however zero padded 3d convolutions, which means the VAE is actually just 2D down the last element of the conv weight in the temporal dimension. Fast path this, to avoid adding zeros that then just evaporate in convoluton math but cost computation.

rattus128 requested review from Kosinkadink, comfyanonymous and guill as code owners January 23, 2026 02:47

rattus128 marked this pull request as draft January 23, 2026 02:47

rattus128 added 3 commits January 23, 2026 16:11

ops: introduce autopad for conv3d

70b20d1

This works around pytorch missing ability to causal pad as part of the kernel and avoids massive weight duplications for padding.

wan-vae: rework causal padding

613eee5

This currently uses F.pad which takes a full deep copy and is liable to be the VRAM peak. Instead, kick spatial padding back to the op and consolidate the temporal padding with the cat for the cache.

rattus128 force-pushed the prs/qwen-vae-2d branch from 721138c to ea9d6fa Compare January 23, 2026 06:15

rattus128 marked this pull request as ready for review January 23, 2026 07:41

comfyanonymous merged commit 4e6a1b6 into Comfy-Org:master Jan 24, 2026
12 checks passed

ahelme mentioned this pull request Jan 31, 2026

Changelogs for Migration: ComfyUI v0.8.1 through to ComfyUI v0.11.1 ahelme/comfy-multi#26

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed up and reduce VRAM of QWEN VAE and WAN (less so)#12036

speed up and reduce VRAM of QWEN VAE and WAN (less so)#12036
comfyanonymous merged 3 commits intoComfy-Org:masterfrom
rattus128:prs/qwen-vae-2d

rattus128 commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rattus128 commented Jan 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants