Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds SCAIL support across the WAN video pipeline. Introduces SCAILWanModel (subclassing WanModel) with pose-aware patch embedding, pose-conditioned forward_orig/_forward, and pose-frequency concatenation in rope_encode. Adds WAN21_SCAIL model class and integrates pose_latents handling via extra_conds, extra_conds_shapes, and apply_model (including pose_start/pose_end gating and memory config updates). Extends detect_unet_config to detect SCAIL state dicts, registers WAN21_SCAIL in supported models, and adds a WanSCAILToVideo node to produce pose-conditioned video latents. 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_wan.py`:
- Around line 1492-1503: The code always creates and encodes a zero
reference_image, causing reference conditioning to be applied even when no image
was provided; change the logic so you only call comfy.utils.common_upscale,
vae.encode and node_helpers.conditioning_set_values when the original
reference_image input is non-None (i.e., remove or guard the torch.zeros((1,
height, width, 3)) fallback and the subsequent ref_latent path unless
reference_image was actually supplied), leaving an explicit “use blank
reference” toggle if you need to preserve the previous behavior; update
references to reference_image, ref_latent, vae.encode,
comfy.utils.common_upscale, and node_helpers.conditioning_set_values
accordingly.
In `@comfy/model_base.py`:
- Around line 1500-1503: The pose-window gating in apply_model uses tensor
truthiness in "if t >= self.model_sampling.percent_to_sigma(pose_start) or t <=
self.model_sampling.percent_to_sigma(pose_end):", which breaks for batched
tensors; convert t to a Python scalar before the comparisons (e.g. t_val =
t.item() or float(t)) or explicitly reduce the tensor (e.g. t_val =
t.mean().item()) and then compare against model_sampling.percent_to_sigma(...)
and call kwargs.pop("pose_latents", None) based on that scalar; update the
apply_model function to use t_val in the two comparisons to avoid tensor
truthiness errors.
- Around line 1510-1519: The reported shapes in extra_conds_shapes are incorrect
because extra_conds appends a 4‑channel mask to the reference latents; update
the computed shape for the 'reference_latent' key in extra_conds_shapes so the
channel count increases by 4 (i.e., take the existing channel value 16 and add 4
to it when building the list), leaving the sum(...) // 16 computation for the
spatial/flattened dimension unchanged; modify the logic in the
extra_conds_shapes function (reference_latent key) accordingly so VRAM estimates
include the added mask channels.
- Around line 1479-1491: The pose embedding channel count is mismatched: code in
process_latent_in/pose_video_latent concatenates a 4-channel mask producing 20
channels but SCAILWanModel.patch_embedding_pose expects in_dim=16; fix by either
(A) removing the mask concatenation for pose_latents in comfy/model_base.py
(stop creating pose_mask and stop torch.cat for pose_latents so
out['pose_latents'] remains 16-channel), or (B) update SCAILWanModel to accept
20 channels by changing the Conv3d initialization patch_embedding_pose's in_dim
from 16 to 20 (and any in_dim references/configs) so the layer shape matches the
20-channel pose_latents.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
comfy/ldm/wan/model.pycomfy/model_base.pycomfy/model_detection.pycomfy/supported_models.pycomfy_extras/nodes_wan.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_wan.py`:
- Around line 1477-1478: The pose_start and pose_end inputs
(io.Float.Input("pose_start", ...) and io.Float.Input("pose_end", ...)) allow
values up to 10.0 but are later interpreted as schedule fractions passed to
percent_to_sigma(), which expects values in [0.0, 1.0]; update both input
definitions to use max=1.0 (keeping min=0.0 and step=0.01) so user-entered
values match percent_to_sigma()'s expected range and prevent out-of-range
behavior.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/ldm/wan/model.py`:
- Around line 1726-1727: In _forward, don't blindly increment t_len when
"reference_latent" exists; instead check kwargs["reference_latent"] is not None
and determine the number of reference frames (e.g., if it's a tensor use
kwargs["reference_latent"].shape[1] or if list/tuple use len(...)) and add that
count to t_len so freqs aligns with actual reference frames; update the
conditional around t_len increment to handle None and multi-frame inputs (use
the symbols _forward, t_len, and reference_latent to locate and modify the
logic).
- Around line 1716-1729: In _forward, pad the pose_latents tensor to the model
patch size before using it in RoPE and before passing it to forward_orig: use
comfy.ldm.common_dit.pad_to_patch_size(pose_latents, self.patch_size) (similar
to how x and time_dim_concat are padded), assign the padded tensor back to
pose_latents (or a new var) and pass that padded tensor into
self.rope_encode(...) and into the call to forward_orig so the spatial token
counts match and Conv3d/patch_embedding_pose do not misalign on odd H/W.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
comfy/ldm/wan/model.py (1)
1729-1730:⚠️ Potential issue | 🟠 MajorEnsure
t_lenmatches reference_latent frame count.Line 1729 adds
1, but Line 1633 concatenates the full reference sequence. Ifreference_latentcan contain multiple frames,t_lenshould addreference_latent.shape[2]to keep RoPE aligned. If it’s guaranteed to be a single frame, documenting that assumption would avoid future drift.🔧 Suggested adjustment (if multi-frame is possible)
- if "reference_latent" in kwargs: - t_len += 1 + reference_latent = kwargs.get("reference_latent") + if reference_latent is not None: + t_len += reference_latent.shape[2]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/ldm/wan/model.py` around lines 1729 - 1730, The t_len increment when handling reference_latent is incorrect for multi-frame sequences: instead of adding a constant 1, update the logic in the block that checks "reference_latent" to add the actual frame count (use reference_latent.shape[2]) so t_len reflects the full reference sequence length (or, if only a single frame is truly guaranteed, add a comment/docstring near the reference_latent handling and keep the +1 but document the single-frame assumption); adjust the code that uses t_len (and any RoPE alignment logic) accordingly to ensure consistent lengths.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/ldm/wan/model.py`:
- Around line 1716-1731: The bug: when "reference_latent" is present in _forward
it’s concatenated to x without being padded, which can crash for odd H/W; fix by
calling comfy.ldm.common_dit.pad_to_patch_size on kwargs["reference_latent"]
(similar to x and pose_latents) before any concatenation and before the t_len +=
1 adjustment so the spatial dimensions match; update the handling in _forward to
replace kwargs["reference_latent"] with the padded tensor prior to
concatenation.
---
Duplicate comments:
In `@comfy/ldm/wan/model.py`:
- Around line 1729-1730: The t_len increment when handling reference_latent is
incorrect for multi-frame sequences: instead of adding a constant 1, update the
logic in the block that checks "reference_latent" to add the actual frame count
(use reference_latent.shape[2]) so t_len reflects the full reference sequence
length (or, if only a single frame is truly guaranteed, add a comment/docstring
near the reference_latent handling and keep the +1 but document the single-frame
assumption); adjust the code that uses t_len (and any RoPE alignment logic)
accordingly to ensure consistent lengths.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
comfy/ldm/wan/model.pycomfy_extras/nodes_wan.py
🚧 Files skipped from review as they are similar to previous changes (1)
- comfy_extras/nodes_wan.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
comfy/ldm/wan/model.py (1)
1695-1713: Pose RoPE frequencies don't inherit temporalrope_optionsfromtransformer_options.
pose_transformer_optionsonly setsshift_y/xandscale_y/x. If the caller supplies temporal rope options (scale_t,shift_t) intransformer_options— e.g. for native context-window scheduling — those options are applied to main-token frequencies but silently dropped for pose frequencies, creating a temporal position mismatch between main and pose tokens.Consider merging temporal options from the outer
transformer_optionsbefore overriding spatial ones:♻️ Suggested fix
+ outer_rope_options = transformer_options.get("rope_options", {}) + pose_rope_options = { + **{k: v for k, v in outer_rope_options.items() if k in ("scale_t", "shift_t")}, + "shift_y": h_shift, "shift_x": 120.0 + w_shift, + "scale_y": h_scale, "scale_x": w_scale, + } - pose_transformer_options = {"rope_options": {"shift_y": h_shift, "shift_x": 120.0 + w_shift, "scale_y": h_scale, "scale_x": w_scale}} + pose_transformer_options = {"rope_options": pose_rope_options} pose_freqs = super().rope_encode(F_pose, H_pose, W_pose, t_start=t_start, device=device, dtype=dtype, transformer_options=pose_transformer_options)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/ldm/wan/model.py` around lines 1695 - 1713, The pose RoPE frequencies in rope_encode ignore temporal rope options passed in transformer_options, causing temporal misalignment; modify the construction of pose_transformer_options inside rope_encode so it merges any existing temporal entries (e.g., "scale_t", "shift_t" under "rope_options") from transformer_options into the new pose_transformer_options before overriding spatial keys ("scale_y", "scale_x", "shift_y", "shift_x"), ensuring pose_freqs are computed with the same temporal rope settings as main_freqs; update references to transformer_options and pose_transformer_options in rope_encode accordingly.comfy/model_base.py (1)
1476-1498:super().extra_conds()redundantly computesreference_latentbefore SCAIL overwrites it.
WAN21.extra_conds(the super call at line 1477) processesreference_latents[-1]and stores a 2D frame slice inout['reference_latent']. Lines 1479–1484 then immediately overwrite it with the 3D tensor including the mask. The work done by the parent is discarded every call.This is not a bug — the overwrite is correct — but extracting only the needed parts from
super()(cross-attn, clip_fea) and handlingreference_latentinline would eliminate the wasted allocation.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/model_base.py` around lines 1476 - 1498, super().extra_conds currently receives the full kwargs so the parent (WAN21.extra_conds) computes and stores out['reference_latent'], which you then immediately overwrite; to avoid that wasted allocation, call super().extra_conds with a copy of kwargs that has 'reference_latents' removed (e.g. kwargs_no_ref = dict(kwargs); kwargs_no_ref.pop('reference_latents', None)) so the parent returns only the other conds (cross-attn, clip_fea), then handle reference_latents inline in WAN21.extra_conds by processing reference_latents via self.process_latent_in(...) and creating out['reference_latent'] = comfy.conds.CONDRegular(...) as you already do.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/model_base.py`:
- Around line 1510-1514: The reported shape in extra_conds_shapes over-sums all
tensors in reference_latents while extra_conds only uses the last one; change
the computation to use only ref_latents[-1] (e.g., compute
math.prod(ref_latents[-1].size()) // 16) and keep the same list format, and also
guard for empty or None reference_latents the same way extra_conds does so you
don't raise on missing entries; update the logic in extra_conds_shapes to mirror
extra_conds's usage of reference_latents.
---
Nitpick comments:
In `@comfy/ldm/wan/model.py`:
- Around line 1695-1713: The pose RoPE frequencies in rope_encode ignore
temporal rope options passed in transformer_options, causing temporal
misalignment; modify the construction of pose_transformer_options inside
rope_encode so it merges any existing temporal entries (e.g., "scale_t",
"shift_t" under "rope_options") from transformer_options into the new
pose_transformer_options before overriding spatial keys ("scale_y", "scale_x",
"shift_y", "shift_x"), ensuring pose_freqs are computed with the same temporal
rope settings as main_freqs; update references to transformer_options and
pose_transformer_options in rope_encode accordingly.
In `@comfy/model_base.py`:
- Around line 1476-1498: super().extra_conds currently receives the full kwargs
so the parent (WAN21.extra_conds) computes and stores out['reference_latent'],
which you then immediately overwrite; to avoid that wasted allocation, call
super().extra_conds with a copy of kwargs that has 'reference_latents' removed
(e.g. kwargs_no_ref = dict(kwargs); kwargs_no_ref.pop('reference_latents',
None)) so the parent returns only the other conds (cross-attn, clip_fea), then
handle reference_latents inline in WAN21.extra_conds by processing
reference_latents via self.process_latent_in(...) and creating
out['reference_latent'] = comfy.conds.CONDRegular(...) as you already do.
Adds model support for SCAIL: https://github.com/zai-org/SCAIL
Compatible weights:
https://huggingface.co/Kijai/WanVideo_comfy/blob/main/SCAIL/Wan21-14B-SCAIL-preview_comfy_bf16.safetensors
https://huggingface.co/Kijai/WanVideo_comfy_fp8_scaled/blob/main/SCAIL/Wan21-14B-SCAIL-preview_fp8_scaled_mixed.safetensors
Preprocessor nodes:
https://github.com/kijai/ComfyUI-SCAIL-Pose
Example scail_native_context_test.json:
scail_00002.mp4
Example pose input:
scail_pose_00002.mp4