Skip to content

Propagate Qwen-Image attention mask to image generation.#11966

Merged
comfyanonymous merged 1 commit intoComfy-Org:masterfrom
maromri:feature/qwen-image/propagate-attention-mask
Jan 23, 2026
Merged

Propagate Qwen-Image attention mask to image generation.#11966
comfyanonymous merged 1 commit intoComfy-Org:masterfrom
maromri:feature/qwen-image/propagate-attention-mask

Conversation

@maromri
Copy link
Contributor

@maromri maromri commented Jan 19, 2026

Why?

For optimization purposes, it is sometimes recommended to run model inference with inputs of a fixed size. This can be supported by padding the text tokens to a fixed length, with the padding information propagated to the diffusion model using an attention mask. This mechanism is implemented in ComfyUI for many models, but not for Qwen-Image.

What?

  1. Pass the attention mask from the text encoder to the model's forward function. This is similar to the implementation of many other models.
  2. Convert the mask from binary 1/0 format to 0/-∞ format, to be used as an additive mask where attention is calculated as softmax(scores + mask). This is similar to the implementation in hunyuan_video, LTX and Cosmos.
  3. Construct a joint attention mask for both text and image tokens; The text portion is copied from the attention mask passed from the text encoder, while the image portion always attends (mask = 0). This is similar to the implementation in hunyuan_video.

Running example

I used the template workflow for Qwen-Image-2512 (its lower part with 4-steps lightning LoRA) and changed only the positive prompt and the seed:

  • prompt: "A wintery, cloudy, Christmassy, slightly snowy day in England"
  • seed: 89 (fixed)

In order to add text tokens padding, I modified the initialization of Qwen25_7BVLITokenizer (the min_length argument from 1 to 256) in text_encoders/qwen_image.

The following grid presents the results with min_length=1 and min_length=256, without and with this proposed fix; we can see that with the existing implementation, which does not pass the attention mask, the model attends to the padding tokens and the content of the image shifts dramatically, while my fix encourages a much subtler change.

without fix with fix
min_length = 1 without_fix_1 with_fix_1
min_length = 256 without_fix_256 with_fix_256

@comfy-pr-bot
Copy link
Member

Test Evidence Check

@comfyanonymous comfyanonymous merged commit d7f3241 into Comfy-Org:master Jan 23, 2026
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants