Skip to content

feat: Support SCAIL WanVideo model#12614

Open
kijai wants to merge 9 commits intoComfy-Org:masterfrom
kijai:scail
Open

feat: Support SCAIL WanVideo model#12614
kijai wants to merge 9 commits intoComfy-Org:masterfrom
kijai:scail

Conversation

@kijai
Copy link
Contributor

@kijai kijai commented Feb 24, 2026

@coderabbitai
Copy link

coderabbitai bot commented Feb 24, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds SCAIL support across the WAN video pipeline. Introduces SCAILWanModel (subclassing WanModel) with pose-aware patch embedding, pose-conditioned forward_orig/_forward, and pose-frequency concatenation in rope_encode. Adds WAN21_SCAIL model class and integrates pose_latents handling via extra_conds, extra_conds_shapes, and apply_model (including pose_start/pose_end gating and memory config updates). Extends detect_unet_config to detect SCAIL state dicts, registers WAN21_SCAIL in supported models, and adds a WanSCAILToVideo node to produce pose-conditioned video latents.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'feat: Support SCAIL WanVideo model' accurately describes the main change: adding support for a new model variant called SCAIL for WanVideo.
Description check ✅ Passed The description is directly related to the changeset, providing context about SCAIL model support, compatible weights, preprocessor nodes, and examples.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_wan.py`:
- Around line 1492-1503: The code always creates and encodes a zero
reference_image, causing reference conditioning to be applied even when no image
was provided; change the logic so you only call comfy.utils.common_upscale,
vae.encode and node_helpers.conditioning_set_values when the original
reference_image input is non-None (i.e., remove or guard the torch.zeros((1,
height, width, 3)) fallback and the subsequent ref_latent path unless
reference_image was actually supplied), leaving an explicit “use blank
reference” toggle if you need to preserve the previous behavior; update
references to reference_image, ref_latent, vae.encode,
comfy.utils.common_upscale, and node_helpers.conditioning_set_values
accordingly.

In `@comfy/model_base.py`:
- Around line 1500-1503: The pose-window gating in apply_model uses tensor
truthiness in "if t >= self.model_sampling.percent_to_sigma(pose_start) or t <=
self.model_sampling.percent_to_sigma(pose_end):", which breaks for batched
tensors; convert t to a Python scalar before the comparisons (e.g. t_val =
t.item() or float(t)) or explicitly reduce the tensor (e.g. t_val =
t.mean().item()) and then compare against model_sampling.percent_to_sigma(...)
and call kwargs.pop("pose_latents", None) based on that scalar; update the
apply_model function to use t_val in the two comparisons to avoid tensor
truthiness errors.
- Around line 1510-1519: The reported shapes in extra_conds_shapes are incorrect
because extra_conds appends a 4‑channel mask to the reference latents; update
the computed shape for the 'reference_latent' key in extra_conds_shapes so the
channel count increases by 4 (i.e., take the existing channel value 16 and add 4
to it when building the list), leaving the sum(...) // 16 computation for the
spatial/flattened dimension unchanged; modify the logic in the
extra_conds_shapes function (reference_latent key) accordingly so VRAM estimates
include the added mask channels.
- Around line 1479-1491: The pose embedding channel count is mismatched: code in
process_latent_in/pose_video_latent concatenates a 4-channel mask producing 20
channels but SCAILWanModel.patch_embedding_pose expects in_dim=16; fix by either
(A) removing the mask concatenation for pose_latents in comfy/model_base.py
(stop creating pose_mask and stop torch.cat for pose_latents so
out['pose_latents'] remains 16-channel), or (B) update SCAILWanModel to accept
20 channels by changing the Conv3d initialization patch_embedding_pose's in_dim
from 16 to 20 (and any in_dim references/configs) so the layer shape matches the
20-channel pose_latents.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11fefa5 and 15ddf5f.

📒 Files selected for processing (5)
  • comfy/ldm/wan/model.py
  • comfy/model_base.py
  • comfy/model_detection.py
  • comfy/supported_models.py
  • comfy_extras/nodes_wan.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_wan.py`:
- Around line 1477-1478: The pose_start and pose_end inputs
(io.Float.Input("pose_start", ...) and io.Float.Input("pose_end", ...)) allow
values up to 10.0 but are later interpreted as schedule fractions passed to
percent_to_sigma(), which expects values in [0.0, 1.0]; update both input
definitions to use max=1.0 (keeping min=0.0 and step=0.01) so user-entered
values match percent_to_sigma()'s expected range and prevent out-of-range
behavior.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 15ddf5f and 605b318.

📒 Files selected for processing (1)
  • comfy_extras/nodes_wan.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/wan/model.py`:
- Around line 1726-1727: In _forward, don't blindly increment t_len when
"reference_latent" exists; instead check kwargs["reference_latent"] is not None
and determine the number of reference frames (e.g., if it's a tensor use
kwargs["reference_latent"].shape[1] or if list/tuple use len(...)) and add that
count to t_len so freqs aligns with actual reference frames; update the
conditional around t_len increment to handle None and multi-frame inputs (use
the symbols _forward, t_len, and reference_latent to locate and modify the
logic).
- Around line 1716-1729: In _forward, pad the pose_latents tensor to the model
patch size before using it in RoPE and before passing it to forward_orig: use
comfy.ldm.common_dit.pad_to_patch_size(pose_latents, self.patch_size) (similar
to how x and time_dim_concat are padded), assign the padded tensor back to
pose_latents (or a new var) and pass that padded tensor into
self.rope_encode(...) and into the call to forward_orig so the spatial token
counts match and Conv3d/patch_embedding_pose do not misalign on odd H/W.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 605b318 and 07f7365.

📒 Files selected for processing (1)
  • comfy/ldm/wan/model.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
comfy/ldm/wan/model.py (1)

1729-1730: ⚠️ Potential issue | 🟠 Major

Ensure t_len matches reference_latent frame count.

Line 1729 adds 1, but Line 1633 concatenates the full reference sequence. If reference_latent can contain multiple frames, t_len should add reference_latent.shape[2] to keep RoPE aligned. If it’s guaranteed to be a single frame, documenting that assumption would avoid future drift.

🔧 Suggested adjustment (if multi-frame is possible)
-        if "reference_latent" in kwargs:
-            t_len += 1
+        reference_latent = kwargs.get("reference_latent")
+        if reference_latent is not None:
+            t_len += reference_latent.shape[2]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/wan/model.py` around lines 1729 - 1730, The t_len increment when
handling reference_latent is incorrect for multi-frame sequences: instead of
adding a constant 1, update the logic in the block that checks
"reference_latent" to add the actual frame count (use reference_latent.shape[2])
so t_len reflects the full reference sequence length (or, if only a single frame
is truly guaranteed, add a comment/docstring near the reference_latent handling
and keep the +1 but document the single-frame assumption); adjust the code that
uses t_len (and any RoPE alignment logic) accordingly to ensure consistent
lengths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/ldm/wan/model.py`:
- Around line 1716-1731: The bug: when "reference_latent" is present in _forward
it’s concatenated to x without being padded, which can crash for odd H/W; fix by
calling comfy.ldm.common_dit.pad_to_patch_size on kwargs["reference_latent"]
(similar to x and pose_latents) before any concatenation and before the t_len +=
1 adjustment so the spatial dimensions match; update the handling in _forward to
replace kwargs["reference_latent"] with the padded tensor prior to
concatenation.

---

Duplicate comments:
In `@comfy/ldm/wan/model.py`:
- Around line 1729-1730: The t_len increment when handling reference_latent is
incorrect for multi-frame sequences: instead of adding a constant 1, update the
logic in the block that checks "reference_latent" to add the actual frame count
(use reference_latent.shape[2]) so t_len reflects the full reference sequence
length (or, if only a single frame is truly guaranteed, add a comment/docstring
near the reference_latent handling and keep the +1 but document the single-frame
assumption); adjust the code that uses t_len (and any RoPE alignment logic)
accordingly to ensure consistent lengths.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 07f7365 and f927863.

📒 Files selected for processing (2)
  • comfy/ldm/wan/model.py
  • comfy_extras/nodes_wan.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • comfy_extras/nodes_wan.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
comfy/ldm/wan/model.py (1)

1695-1713: Pose RoPE frequencies don't inherit temporal rope_options from transformer_options.

pose_transformer_options only sets shift_y/x and scale_y/x. If the caller supplies temporal rope options (scale_t, shift_t) in transformer_options — e.g. for native context-window scheduling — those options are applied to main-token frequencies but silently dropped for pose frequencies, creating a temporal position mismatch between main and pose tokens.

Consider merging temporal options from the outer transformer_options before overriding spatial ones:

♻️ Suggested fix
+        outer_rope_options = transformer_options.get("rope_options", {})
+        pose_rope_options = {
+            **{k: v for k, v in outer_rope_options.items() if k in ("scale_t", "shift_t")},
+            "shift_y": h_shift, "shift_x": 120.0 + w_shift,
+            "scale_y": h_scale, "scale_x": w_scale,
+        }
-        pose_transformer_options = {"rope_options": {"shift_y": h_shift, "shift_x": 120.0 + w_shift, "scale_y": h_scale, "scale_x": w_scale}}
+        pose_transformer_options = {"rope_options": pose_rope_options}
         pose_freqs = super().rope_encode(F_pose, H_pose, W_pose, t_start=t_start, device=device, dtype=dtype, transformer_options=pose_transformer_options)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/ldm/wan/model.py` around lines 1695 - 1713, The pose RoPE frequencies
in rope_encode ignore temporal rope options passed in transformer_options,
causing temporal misalignment; modify the construction of
pose_transformer_options inside rope_encode so it merges any existing temporal
entries (e.g., "scale_t", "shift_t" under "rope_options") from
transformer_options into the new pose_transformer_options before overriding
spatial keys ("scale_y", "scale_x", "shift_y", "shift_x"), ensuring pose_freqs
are computed with the same temporal rope settings as main_freqs; update
references to transformer_options and pose_transformer_options in rope_encode
accordingly.
comfy/model_base.py (1)

1476-1498: super().extra_conds() redundantly computes reference_latent before SCAIL overwrites it.

WAN21.extra_conds (the super call at line 1477) processes reference_latents[-1] and stores a 2D frame slice in out['reference_latent']. Lines 1479–1484 then immediately overwrite it with the 3D tensor including the mask. The work done by the parent is discarded every call.

This is not a bug — the overwrite is correct — but extracting only the needed parts from super() (cross-attn, clip_fea) and handling reference_latent inline would eliminate the wasted allocation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_base.py` around lines 1476 - 1498, super().extra_conds currently
receives the full kwargs so the parent (WAN21.extra_conds) computes and stores
out['reference_latent'], which you then immediately overwrite; to avoid that
wasted allocation, call super().extra_conds with a copy of kwargs that has
'reference_latents' removed (e.g. kwargs_no_ref = dict(kwargs);
kwargs_no_ref.pop('reference_latents', None)) so the parent returns only the
other conds (cross-attn, clip_fea), then handle reference_latents inline in
WAN21.extra_conds by processing reference_latents via
self.process_latent_in(...) and creating out['reference_latent'] =
comfy.conds.CONDRegular(...) as you already do.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/model_base.py`:
- Around line 1510-1514: The reported shape in extra_conds_shapes over-sums all
tensors in reference_latents while extra_conds only uses the last one; change
the computation to use only ref_latents[-1] (e.g., compute
math.prod(ref_latents[-1].size()) // 16) and keep the same list format, and also
guard for empty or None reference_latents the same way extra_conds does so you
don't raise on missing entries; update the logic in extra_conds_shapes to mirror
extra_conds's usage of reference_latents.

---

Nitpick comments:
In `@comfy/ldm/wan/model.py`:
- Around line 1695-1713: The pose RoPE frequencies in rope_encode ignore
temporal rope options passed in transformer_options, causing temporal
misalignment; modify the construction of pose_transformer_options inside
rope_encode so it merges any existing temporal entries (e.g., "scale_t",
"shift_t" under "rope_options") from transformer_options into the new
pose_transformer_options before overriding spatial keys ("scale_y", "scale_x",
"shift_y", "shift_x"), ensuring pose_freqs are computed with the same temporal
rope settings as main_freqs; update references to transformer_options and
pose_transformer_options in rope_encode accordingly.

In `@comfy/model_base.py`:
- Around line 1476-1498: super().extra_conds currently receives the full kwargs
so the parent (WAN21.extra_conds) computes and stores out['reference_latent'],
which you then immediately overwrite; to avoid that wasted allocation, call
super().extra_conds with a copy of kwargs that has 'reference_latents' removed
(e.g. kwargs_no_ref = dict(kwargs); kwargs_no_ref.pop('reference_latents',
None)) so the parent returns only the other conds (cross-attn, clip_fea), then
handle reference_latents inline in WAN21.extra_conds by processing
reference_latents via self.process_latent_in(...) and creating
out['reference_latent'] = comfy.conds.CONDRegular(...) as you already do.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f927863 and e5bb1fe.

📒 Files selected for processing (2)
  • comfy/ldm/wan/model.py
  • comfy/model_base.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant