Skip to content

Native LongCat-Image implementation#12597

Open
Talmaj wants to merge 13 commits intoComfy-Org:masterfrom
Talmaj:LongCat-Image
Open

Native LongCat-Image implementation#12597
Talmaj wants to merge 13 commits intoComfy-Org:masterfrom
Talmaj:LongCat-Image

Conversation

@Talmaj
Copy link

@Talmaj Talmaj commented Feb 23, 2026

LongCat-Image ComfyUI Port

Adds native support for
LongCat-Image,
a Flux-based text-to-image model by Meituan, to ComfyUI.

Architecture

LongCat-Image is a Flux variant with:

  • Transformer: MM-DiT + Single-DiT (19 double blocks, 38 single blocks)
  • Text encoder: Qwen2.5-VL-7B with character-level encoding for quoted text
  • VAE: AutoencoderKL with 2x2 latent packing
  • 3D MRoPE: Multimodal Rotary Position Embeddings with shifts
    (t=1.0, y=512.0, x=512.0)

Key implementation details

Pre-converted weights

The original LongCat-Image weights use HuggingFace Diffusers key names.
ComfyUI requires pre-converted weights in its native Flux format. A standalone
download_original.sh and convert_original_to_comfy.py scripts (hosted alongside the weights in the Comfy-Org HF repo)
performs the one-time conversion:

  • Key renaming (e.g. x_embedderimg_in, context_embeddertxt_in,
    transformer_blocksdouble_blocks,
    single_transformer_blockssingle_blocks)
  • Q/K/V fusion into single QKV tensors
  • Scale/shift half-swap on norm_out.linear weights — HuggingFace's
    AdaLayerNormContinuous stores [scale | shift] while ComfyUI's LastLayer
    expects [shift | scale]

Pre-converting avoids runtime torch.cat allocations, enabling ComfyUI's
zero-copy-from-disk memory mapping where tensors are referenced directly from
the safetensors file without loading into RAM.

Model detection

Pre-converted weights go through the standard Flux detection path. LongCat-Image
is distinguished from other Flux variants by a heuristic at the end of Flux
detection: context_in_dim == 3584 (from txt_in.weight shape) and
vec_in_dim is None (no vector_in layer). This sets txt_ids_dims = [1, 2],
matching the LongCatImage config. The detection algorithm in
model_config_from_unet_config selects the most specific match (highest
unet_config key count) rather than first match, so LongCatImage (5 config
keys) always wins over FluxSchnell (2 config keys) regardless of list order.

Tokenizer

LongCatImageBaseTokenizer applies the Qwen2.5 chat template, handles
character-level tokenization for quoted text via split_quotation, and pads to
a fixed max_length=512 to match the expected input format.

CFG renormalization

The CFGRenormLongCatImage node applies per-patch CFG renormalization via
sampler_post_cfg_function. It reshapes to Flux's packed patch format, computes
per-patch L2 norms, clamps the scale factor, and reshapes back.

No guidance embedding

Unlike standard Flux, LongCat-Image does not use a guidance conditioning tensor.
LongCatImage.extra_conds removes the guidance key.

Known differences from HuggingFace

  • Pad token embeddings: HuggingFace runs the text encoder in bfloat16,
    which rounds pad token embeddings to identical vectors. ComfyUI runs in
    float32, preserving small differences from causal attention and RoPE — each
    pad position gets a slightly different vector. This does not affect output
    quality since the attention mask zeros out pad tokens during the diffusion
    transformer.
  • Sigma schedule: HuggingFace uses FlowMatchEulerDiscreteScheduler with
    dynamic shifting (use_dynamic_shifting=True), computing a mu parameter
    via linear interpolation based on image sequence length. ComfyUI's
    ModelSamplingFlux uses a static shift=1.15 with flux_time_shift,
    producing a slightly different sigma schedule for the same number of steps.

Files

File Purpose
comfy/supported_models.py LongCatImage config and detection matching
comfy/model_base.py LongCatImage model class with MRoPE shifts
comfy/model_detection.py Flux detection path with LongCat-Image heuristic
comfy/text_encoders/longcat_image.py Tokenizer and text encoder
comfy_extras/nodes_longcat_image.py CLIPTextEncodeLongCatImage, CFGRenormLongCatImage nodes
user_templates/longcat_image_t2i.json User template
blueprints/Text to Image (LongCat-Image).json Blueprint
tests-unit/comfy_test/model_detection_test.py Model detection unit tests

@coderabbitai
Copy link

coderabbitai bot commented Feb 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds LongCat-Image support across the codebase: a new Text-to-Image blueprint JSON; new LongCatImage model class and Flux adjustments in comfy/model_base.py; UNet detection update in comfy/model_detection.py; CLIPType enum and LONGCAT_IMAGE text-encoder loading in comfy/sd.py; new supported model entry in comfy/supported_models.py; a LongCatImage tokenizer/TE implementation in comfy/text_encoders/longcat_image.py; two Comfy nodes and an extension in comfy_extras/nodes_longcat_image.py; CLIPLoader option and extras registration; and unit tests for model detection and conversion.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 14.29% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Native LongCat-Image implementation' directly and concisely describes the main change—adding native ComfyUI support for the LongCat-Image model.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, covering architecture, implementation details, known differences, and file purposes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
tests-unit/comfy_test/model_detection_test.py (1)

73-73: Unused variable original_models.

original_models is assigned but never referenced. Likely leftover from a manual save/restore approach that was replaced by patch.object.

🧹 Remove unused variable
         sd = _make_longcat_diffusers_sd()
         unet_config = detect_unet_config(sd, "")
-        original_models = comfy.supported_models.models
 
         longcat_cls = comfy.supported_models.LongCatImage
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/comfy_test/model_detection_test.py` at line 73, Remove the unused
local variable original_models assigned from comfy.supported_models.models in
the test; since patch.object is handling temporary replacement/restore, delete
the assignment to original_models to eliminate the dead code and keep the test
clean (look for the assignment to original_models and the reference to
comfy.supported_models.models in model_detection_test.py).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy_extras/nodes_longcat_image.py`:
- Around line 61-82: The code assumes H and W are divisible by ps=2 before
reshaping (see noise.reshape(...), cond_packed reshape and renormed), which will
raise on odd spatial sizes; add a defensive check right after B, C, H, W =
denoised.shape to verify H % ps == 0 and W % ps == 0 and raise a clear
ValueError including the offending H/W and ps (or alternatively apply symmetric
padding to x/denoised/cond_denoised to make them divisible by ps before the
pack/unpack operations), then proceed with the existing noise/cond packing,
scaling and renorming logic unchanged.

In `@comfy/text_encoders/longcat_image.py`:
- Around line 70-95: In tokenize_with_weights: avoid letting
base_tok.tokenize_with_weights produce 512-length padding before you add the
LongCat template; call base_tok.tokenize_with_weights with padding disabled
(e.g., disable_padding=True or equivalent kwarg) so prompt_pairs is produced
without pre-padding, then build prefix_pairs, prompt_pairs and suffix_pairs into
combined, and only after combining perform truncation/padding to model length
(use your tokenizer's pad/truncate utility or call the shared super method once
on the final combined token list) so prefix_ids/suffix_ids are not separated by
mid-prompt padding; refer to tokenize_with_weights,
base_tok.tokenize_with_weights, prefix_ids, suffix_ids, prompt_pairs and
combined to locate where to change.
- Around line 102-144: The slice logic in encode_token_weights can use
template_end == -1 (no <|im_start|> found) which makes out = out[:, -1:] (last
token); change encode_token_weights to treat a missing marker by setting
template_end = 0 before slicing (or otherwise avoid negative slice) and only
apply the "+3 newline adjustment" when a real marker was detected (i.e., only
run the tok_pairs[template_end + 1]/[+2] checks if template_end was set from the
loop). Update references in encode_token_weights (template_end, tok_pairs, out,
extra, suffix_start) so the slicing/out = out[:, template_end:] and subsequent
attention_mask adjustments are guarded by the marker presence to avoid
accidentally keeping only the last token.

---

Nitpick comments:
In `@tests-unit/comfy_test/model_detection_test.py`:
- Line 73: Remove the unused local variable original_models assigned from
comfy.supported_models.models in the test; since patch.object is handling
temporary replacement/restore, delete the assignment to original_models to
eliminate the dead code and keep the test clean (look for the assignment to
original_models and the reference to comfy.supported_models.models in
model_detection_test.py).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dba2766 and bfd302f.

📒 Files selected for processing (9)
  • blueprints/Text to Image (LongCat-Image).json
  • comfy/model_base.py
  • comfy/model_detection.py
  • comfy/sd.py
  • comfy/supported_models.py
  • comfy/text_encoders/longcat_image.py
  • comfy_extras/nodes_longcat_image.py
  • nodes.py
  • tests-unit/comfy_test/model_detection_test.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests-unit/comfy_test/model_detection_test.py (2)

84-89: Consider isinstance over type(...).__name__ for class identity assertions.

String-based class-name checks will silently pass if the class is renamed or imported under an alias.

♻️ Proposed fix
-                assert type(result).__name__ == "LongCatImage", (
-                    f"Expected LongCatImage with order {label}, got {type(result).__name__}"
-                )
+                assert isinstance(result, comfy.supported_models.LongCatImage), (
+                    f"Expected LongCatImage with order {label}, got {type(result)}"
+                )

And at line 101:

-        assert type(model_config).__name__ == "LongCatImage"
+        assert isinstance(model_config, comfy.supported_models.LongCatImage)

Also applies to: 99-101

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/comfy_test/model_detection_test.py` around lines 84 - 89, Replace
fragile string-based class checks with real type checks: instead of asserting
type(result).__name__ == "LongCatImage" use an isinstance assertion against the
actual class (e.g., assert isinstance(result, LongCatImage)). Update both
occurrences (the assertion around model_config_from_unet_config and the similar
check at lines ~99-101) and ensure LongCatImage is imported or referenced from
the correct module so the isinstance call resolves.

103-113: Test only verifies key presence/absence, skipping the two non-trivial transforms.

The PR description calls out Q/K/V fusion and a scale/shift half-swap in process_unet_state_dict as the critical parts of the conversion. Neither is exercised here — verifying that, say, transformer_blocks.0.attn.to_q.weight + to_k.weight + to_v.weight are fused into a single double_blocks.0.img_attn.qkv.weight with the right shape, and that the norm_out scale/shift halves are swapped, would meaningfully increase confidence in the conversion correctness.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/comfy_test/model_detection_test.py` around lines 103 - 113, The
test test_longcat_process_unet_state_dict_converts_keys only checks
presence/absence of keys but does not validate the two non-trivial transforms in
process_unet_state_dict: Q/K/V fusion and the norm_out scale/shift half-swap.
Update the test to build source weights for
transformer_blocks.0.attn.to_q/to_k/to_v and the norm_out affine, run converted
= model_config.process_unet_state_dict(...), then assert the fused tensor exists
at double_blocks.0.img_attn.qkv.weight with the expected concatenated shape and
contents (verify slices match original to_q/k/v), and assert norm_out parameters
have their scale/shift halves swapped compared to the input; reference
test_longcat_process_unet_state_dict_converts_keys, process_unet_state_dict,
transformer_blocks.*, attn.to_q/to_k/to_v, double_blocks.0.img_attn.qkv.weight,
and norm_out in your assertions.
comfy_extras/nodes_longcat_image.py (1)

82-83: Replace deprecated torch.norm with torch.linalg.vector_norm.

torch.norm is deprecated and may be removed in a future PyTorch release; its documentation and behavior may be incorrect, and it is no longer actively maintained. The recommended replacement for vector norms is torch.linalg.vector_norm().

♻️ Proposed fix
-            noise_norm = torch.norm(noise_packed, dim=-1, keepdim=True)
-            cond_norm = torch.norm(cond_packed, dim=-1, keepdim=True)
+            noise_norm = torch.linalg.vector_norm(noise_packed, dim=-1, keepdim=True)
+            cond_norm = torch.linalg.vector_norm(cond_packed, dim=-1, keepdim=True)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy_extras/nodes_longcat_image.py` around lines 82 - 83, Replace deprecated
torch.norm calls computing per-vector norms for noise_packed and cond_packed
with torch.linalg.vector_norm; specifically update the expressions that assign
noise_norm and cond_norm (currently using torch.norm(noise_packed, dim=-1,
keepdim=True) and torch.norm(cond_packed, dim=-1, keepdim=True)) to use
torch.linalg.vector_norm(noise_packed, dim=-1, keepdim=True) and
torch.linalg.vector_norm(cond_packed, dim=-1, keepdim=True) respectively so
behavior and signature remain the same but use the supported API.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/model_detection.py`:
- Around line 285-313: The current three-key detection in model_detection.py
(keys '{}x_embedder.weight', '{}transformer_blocks.0.attn.to_q.weight',
'{}single_transformer_blocks.0.attn.to_q.weight' using key_prefix) misidentifies
vanilla Flux diffusers as LongCat-Image; tighten the condition by requiring a
LongCat-specific key (use the existing ctx_key =
'{}context_embedder.weight'.format(key_prefix)) to be present as an additional
positive constraint before building dit_config, so only state dicts that include
context_embedder.weight are considered LongCat-Image (leave count_blocks usage
and subsequent field population unchanged).

---

Nitpick comments:
In `@comfy_extras/nodes_longcat_image.py`:
- Around line 82-83: Replace deprecated torch.norm calls computing per-vector
norms for noise_packed and cond_packed with torch.linalg.vector_norm;
specifically update the expressions that assign noise_norm and cond_norm
(currently using torch.norm(noise_packed, dim=-1, keepdim=True) and
torch.norm(cond_packed, dim=-1, keepdim=True)) to use
torch.linalg.vector_norm(noise_packed, dim=-1, keepdim=True) and
torch.linalg.vector_norm(cond_packed, dim=-1, keepdim=True) respectively so
behavior and signature remain the same but use the supported API.

In `@tests-unit/comfy_test/model_detection_test.py`:
- Around line 84-89: Replace fragile string-based class checks with real type
checks: instead of asserting type(result).__name__ == "LongCatImage" use an
isinstance assertion against the actual class (e.g., assert isinstance(result,
LongCatImage)). Update both occurrences (the assertion around
model_config_from_unet_config and the similar check at lines ~99-101) and ensure
LongCatImage is imported or referenced from the correct module so the isinstance
call resolves.
- Around line 103-113: The test
test_longcat_process_unet_state_dict_converts_keys only checks presence/absence
of keys but does not validate the two non-trivial transforms in
process_unet_state_dict: Q/K/V fusion and the norm_out scale/shift half-swap.
Update the test to build source weights for
transformer_blocks.0.attn.to_q/to_k/to_v and the norm_out affine, run converted
= model_config.process_unet_state_dict(...), then assert the fused tensor exists
at double_blocks.0.img_attn.qkv.weight with the expected concatenated shape and
contents (verify slices match original to_q/k/v), and assert norm_out parameters
have their scale/shift halves swapped compared to the input; reference
test_longcat_process_unet_state_dict_converts_keys, process_unet_state_dict,
transformer_blocks.*, attn.to_q/to_k/to_v, double_blocks.0.img_attn.qkv.weight,
and norm_out in your assertions.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bfd302f and 81a3792.

📒 Files selected for processing (9)
  • blueprints/Text to Image (LongCat-Image).json
  • comfy/model_base.py
  • comfy/model_detection.py
  • comfy/sd.py
  • comfy/supported_models.py
  • comfy/text_encoders/longcat_image.py
  • comfy_extras/nodes_longcat_image.py
  • nodes.py
  • tests-unit/comfy_test/model_detection_test.py
✅ Files skipped from review due to trivial changes (1)
  • blueprints/Text to Image (LongCat-Image).json
🚧 Files skipped from review as they are similar to previous changes (1)
  • nodes.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@comfy/model_detection.py`:
- Around line 827-838: The selection currently adds
len(model_config.required_keys) into specificity even when no state_dict was
provided, biasing picks; modify model_config_from_unet_config so that
required_keys are only counted when a state_dict is present/used (i.e., only add
len(model_config.required_keys) to specificity if state_dict is not None and
thus those keys could be validated), preserving the previous first-match
ordering when state_dict is None; refer to model_config_from_unet_config,
best_specificity, model_config.required_keys and the matches(...) call to
implement this conditional weighting.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 81a3792 and 449282d.

📒 Files selected for processing (2)
  • comfy/model_detection.py
  • comfy/supported_models.py

Comment on lines 827 to 838
def model_config_from_unet_config(unet_config, state_dict=None):
best = None
best_specificity = -1
for model_config in comfy.supported_models.models:
if model_config.matches(unet_config, state_dict):
return model_config(unet_config)
specificity = len(model_config.unet_config) + len(model_config.required_keys)
if specificity > best_specificity:
best = model_config
best_specificity = specificity

if best is not None:
return best(unet_config)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid weighting required_keys when state_dict is None.
Line 832: specificity now includes required_keys even when they weren’t validated (e.g., model_config_from_diffusers_unet passes no state_dict). This can bias selection toward configs with larger required_keys sets and change behavior vs. the previous “first match” ordering.

Proposed fix
-            specificity = len(model_config.unet_config) + len(model_config.required_keys)
+            specificity = len(model_config.unet_config) + (len(model_config.required_keys) if state_dict is not None else 0)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/model_detection.py` around lines 827 - 838, The selection currently
adds len(model_config.required_keys) into specificity even when no state_dict
was provided, biasing picks; modify model_config_from_unet_config so that
required_keys are only counted when a state_dict is present/used (i.e., only add
len(model_config.required_keys) to specificity if state_dict is not None and
thus those keys could be validated), preserving the previous first-match
ordering when state_dict is None; refer to model_config_from_unet_config,
best_specificity, model_config.required_keys and the matches(...) call to
implement this conditional weighting.

prefix = "double_blocks.{}.".format(idx)

if rest.startswith("norm1.linear."):
out_sd[prefix + "img_mod.lin." + rest.split(".")[-1]] = v
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something we can do is repackage/convert the model weights to a format that is easier to load in comfyui instead of having all this logic here it can be a conversion.py file in the hf repo where we host our converted weights.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some standard for repackaging that I can follow?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard is to make something that doesn't need to be converted by comfyui internally to work.


return dit_config

if '{}x_embedder.weight'.format(key_prefix) in state_dict_keys and '{}transformer_blocks.0.attn.to_q.weight'.format(key_prefix) in state_dict_keys and '{}single_transformer_blocks.0.attn.to_q.weight'.format(key_prefix) in state_dict_keys and '{}context_embedder.weight'.format(key_prefix) in state_dict_keys and '{}time_embed.timestep_embedder.linear_1.weight'.format(key_prefix) in state_dict_keys: #LongCat-Image (diffusers format, Flux variant)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repackaging the model file ourselves will also make this logic easier.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
blueprints/Text to Image (LongCat-Image).json (1)

1-1: Expose a negative prompt input in the subgraph.

The second CLIPTextEncodeLongCatImage node isn’t wired to any input, so users can’t set a negative prompt from the blueprint. Consider adding a subgraph input and link for negative text (or remove the node if it’s intentionally fixed).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@blueprints/Text` to Image (LongCat-Image).json at line 1, The subgraph lacks
a user-facing negative prompt because the second CLIPTextEncodeLongCatImage node
(node id 5, type CLIPTextEncodeLongCatImage) has its "text" input unlinked; add
a subgraph input (e.g., inp-neg-text) for the negative prompt and create a link
from that subgraph input to node 5's "text" input so users can supply a negative
prompt, or if the node is intentionally unused, remove node id 5 and its links
(including the link to KSampler) to avoid confusion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests-unit/comfy_test/model_detection_test.py`:
- Around line 8-36: The fixture _make_longcat_comfyui_sd allocates very large
tensors (H=3072) causing multi-GB memory usage; change H to a small value (e.g.,
32) while keeping C_CTX at 3584 so txt_in.weight.shape[1] still reflects the
real context dim, and keep other shape formulas (C_IN, C_CTX, and all uses like
"img_in.weight", "txt_in.weight", "time_in.*", "final_layer.*", and blocks in
"double_blocks.*" and "single_blocks.*") unchanged so the detection logic that
reads tensor shapes and key presence continues to work but without large memory
allocations.

---

Nitpick comments:
In `@blueprints/Text` to Image (LongCat-Image).json:
- Line 1: The subgraph lacks a user-facing negative prompt because the second
CLIPTextEncodeLongCatImage node (node id 5, type CLIPTextEncodeLongCatImage) has
its "text" input unlinked; add a subgraph input (e.g., inp-neg-text) for the
negative prompt and create a link from that subgraph input to node 5's "text"
input so users can supply a negative prompt, or if the node is intentionally
unused, remove node id 5 and its links (including the link to KSampler) to avoid
confusion.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 449282d and 310924a.

📒 Files selected for processing (5)
  • blueprints/Text to Image (LongCat-Image).json
  • comfy/model_detection.py
  • comfy/supported_models.py
  • comfy/text_encoders/longcat_image.py
  • tests-unit/comfy_test/model_detection_test.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • comfy/model_detection.py
  • comfy/text_encoders/longcat_image.py

Comment on lines +8 to +36
def _make_longcat_comfyui_sd():
"""Minimal ComfyUI-format state dict for pre-converted LongCat-Image weights."""
sd = {}
H = 3072
C_IN = 16
C_CTX = 3584

sd["img_in.weight"] = torch.empty(H, C_IN * 4)
sd["img_in.bias"] = torch.empty(H)
sd["txt_in.weight"] = torch.empty(H, C_CTX)
sd["txt_in.bias"] = torch.empty(H)

sd["time_in.in_layer.weight"] = torch.empty(H, 256)
sd["time_in.in_layer.bias"] = torch.empty(H)
sd["time_in.out_layer.weight"] = torch.empty(H, H)
sd["time_in.out_layer.bias"] = torch.empty(H)

sd["final_layer.adaLN_modulation.1.weight"] = torch.empty(2 * H, H)
sd["final_layer.adaLN_modulation.1.bias"] = torch.empty(2 * H)
sd["final_layer.linear.weight"] = torch.empty(C_IN * 4, H)
sd["final_layer.linear.bias"] = torch.empty(C_IN * 4)

for i in range(19):
sd[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = torch.empty(128)
sd[f"double_blocks.{i}.img_attn.qkv.weight"] = torch.empty(3 * H, H)
sd[f"double_blocks.{i}.img_mod.lin.weight"] = torch.empty(H, H)
for i in range(38):
sd[f"single_blocks.{i}.modulation.lin.weight"] = torch.empty(H, H)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check if the test file exists and view the exact code
cat -n tests-unit/comfy_test/model_detection_test.py | head -80

Repository: Comfy-Org/ComfyUI

Length of output: 3589


🏁 Script executed:

# Check git status to see if this is a new file or modified
git log --oneline -- tests-unit/comfy_test/model_detection_test.py 2>/dev/null | head -5 || echo "File may be new or not tracked"

Repository: Comfy-Org/ComfyUI

Length of output: 116


🏁 Script executed:

# Calculate memory footprint of the tensors being allocated
python3 << 'EOF'
import torch

# Current allocation (H=3072)
H_current = 3072
C_IN = 16
C_CTX = 3584

# Calculate sizes for _make_longcat_comfyui_sd
tensors_longcat = {
    "img_in.weight": (H_current, C_IN * 4),
    "img_in.bias": (H_current,),
    "txt_in.weight": (H_current, C_CTX),
    "txt_in.bias": (H_current,),
    "time_in.in_layer.weight": (H_current, 256),
    "time_in.in_layer.bias": (H_current,),
    "time_in.out_layer.weight": (H_current, H_current),
    "time_in.out_layer.bias": (H_current,),
    "final_layer.adaLN_modulation.1.weight": (2 * H_current, H_current),
    "final_layer.adaLN_modulation.1.bias": (2 * H_current,),
    "final_layer.linear.weight": (C_IN * 4, H_current),
    "final_layer.linear.bias": (C_IN * 4,),
}

# Add looped tensors (19 blocks)
for i in range(19):
    tensors_longcat[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = (128,)
    tensors_longcat[f"double_blocks.{i}.img_attn.qkv.weight"] = (3 * H_current, H_current)
    tensors_longcat[f"double_blocks.{i}.img_mod.lin.weight"] = (H_current, H_current)

# Add 38 single blocks
for i in range(38):
    tensors_longcat[f"single_blocks.{i}.modulation.lin.weight"] = (H_current, H_current)

# Calculate total memory (float32 = 4 bytes per element)
total_elements = sum(1 for shape in tensors_longcat.values() for _ in [1] if all(isinstance(x, int) for x in shape))
total_elements = sum(torch.Size(shape).numel() for shape in tensors_longcat.values())
total_bytes = total_elements * 4  # float32
total_gb = total_bytes / (1024**3)

print(f"H=3072 allocations:")
print(f"  Total elements: {total_elements:,}")
print(f"  Total memory: {total_gb:.2f} GB")
print()

# Proposed allocation (H=32)
H_proposed = 32
tensors_longcat_small = {
    "img_in.weight": (H_proposed, C_IN * 4),
    "img_in.bias": (H_proposed,),
    "txt_in.weight": (H_proposed, C_CTX),
    "txt_in.bias": (H_proposed,),
    "time_in.in_layer.weight": (H_proposed, 256),
    "time_in.in_layer.bias": (H_proposed,),
    "time_in.out_layer.weight": (H_proposed, H_proposed),
    "time_in.out_layer.bias": (H_proposed,),
    "final_layer.adaLN_modulation.1.weight": (2 * H_proposed, H_proposed),
    "final_layer.adaLN_modulation.1.bias": (2 * H_proposed,),
    "final_layer.linear.weight": (C_IN * 4, H_proposed),
    "final_layer.linear.bias": (C_IN * 4,),
}

for i in range(19):
    tensors_longcat_small[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = (128,)
    tensors_longcat_small[f"double_blocks.{i}.img_attn.qkv.weight"] = (3 * H_proposed, H_proposed)
    tensors_longcat_small[f"double_blocks.{i}.img_mod.lin.weight"] = (H_proposed, H_proposed)

for i in range(38):
    tensors_longcat_small[f"single_blocks.{i}.modulation.lin.weight"] = (H_proposed, H_proposed)

total_elements_small = sum(torch.Size(shape).numel() for shape in tensors_longcat_small.values())
total_bytes_small = total_elements_small * 4
total_gb_small = total_bytes_small / (1024**3)

print(f"H=32 allocations:")
print(f"  Total elements: {total_elements_small:,}")
print(f"  Total memory: {total_gb_small:.4f} GB ({total_bytes_small / (1024**2):.2f} MB)")
print()
print(f"Reduction: {total_gb / total_gb_small:.1f}x smaller")
EOF

Repository: Comfy-Org/ComfyUI

Length of output: 178


🏁 Script executed:

# Calculate memory footprint manually (without torch module)
python3 << 'EOF'
# Current allocation (H=3072)
H_current = 3072
C_IN = 16
C_CTX = 3584

# Calculate elements for _make_longcat_comfyui_sd
elements = {}
elements["img_in.weight"] = H_current * (C_IN * 4)  # 3072 * 64
elements["img_in.bias"] = H_current
elements["txt_in.weight"] = H_current * C_CTX  # 3072 * 3584
elements["txt_in.bias"] = H_current
elements["time_in.in_layer.weight"] = H_current * 256
elements["time_in.in_layer.bias"] = H_current
elements["time_in.out_layer.weight"] = H_current * H_current  # 3072 * 3072
elements["time_in.out_layer.bias"] = H_current
elements["final_layer.adaLN_modulation.1.weight"] = (2 * H_current) * H_current  # 6144 * 3072
elements["final_layer.adaLN_modulation.1.bias"] = 2 * H_current
elements["final_layer.linear.weight"] = (C_IN * 4) * H_current
elements["final_layer.linear.bias"] = C_IN * 4

# Add loops
for i in range(19):
    elements[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = 128
    elements[f"double_blocks.{i}.img_attn.qkv.weight"] = 3 * H_current * H_current
    elements[f"double_blocks.{i}.img_mod.lin.weight"] = H_current * H_current

for i in range(38):
    elements[f"single_blocks.{i}.modulation.lin.weight"] = H_current * H_current

total_elements = sum(elements.values())
total_bytes = total_elements * 4  # float32
total_gb = total_bytes / (1024**3)

print(f"Longcat with H=3072:")
print(f"  Total elements: {total_elements:,}")
print(f"  Total memory: {total_gb:.2f} GB")
print()

# Proposed allocation (H=32)
H_proposed = 32
elements_small = {}
elements_small["img_in.weight"] = H_proposed * (C_IN * 4)
elements_small["img_in.bias"] = H_proposed
elements_small["txt_in.weight"] = H_proposed * C_CTX
elements_small["txt_in.bias"] = H_proposed
elements_small["time_in.in_layer.weight"] = H_proposed * 256
elements_small["time_in.in_layer.bias"] = H_proposed
elements_small["time_in.out_layer.weight"] = H_proposed * H_proposed
elements_small["time_in.out_layer.bias"] = H_proposed
elements_small["final_layer.adaLN_modulation.1.weight"] = (2 * H_proposed) * H_proposed
elements_small["final_layer.adaLN_modulation.1.bias"] = 2 * H_proposed
elements_small["final_layer.linear.weight"] = (C_IN * 4) * H_proposed
elements_small["final_layer.linear.bias"] = C_IN * 4

for i in range(19):
    elements_small[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = 128
    elements_small[f"double_blocks.{i}.img_attn.qkv.weight"] = 3 * H_proposed * H_proposed
    elements_small[f"double_blocks.{i}.img_mod.lin.weight"] = H_proposed * H_proposed

for i in range(38):
    elements_small[f"single_blocks.{i}.modulation.lin.weight"] = H_proposed * H_proposed

total_elements_small = sum(elements_small.values())
total_bytes_small = total_elements_small * 4
total_gb_small = total_bytes_small / (1024**3)

print(f"Longcat with H=32:")
print(f"  Total elements: {total_elements_small:,}")
print(f"  Total memory: {total_gb_small:.4f} GB ({total_bytes_small / (1024**2):.2f} MB)")
print()
print(f"Reduction factor: {total_gb / total_gb_small:.0f}x smaller")
EOF

Repository: Comfy-Org/ComfyUI

Length of output: 252


🏁 Script executed:

# Check how detect_unet_config and model_config_from_unet_config use the state dict
rg -A 10 "def detect_unet_config" --type py

Repository: Comfy-Org/ComfyUI

Length of output: 1030


🏁 Script executed:

# Search for how these detection functions inspect the state dict
rg "double_blocks|single_blocks|img_attn" comfy/ --type py -A 2 | head -100

Repository: Comfy-Org/ComfyUI

Length of output: 7348


🏁 Script executed:

# Check the git diff to see if these test functions are newly added
git diff HEAD~1 tests-unit/comfy_test/model_detection_test.py 2>/dev/null | head -150 || echo "Git diff not available, checking if file is new"

Repository: Comfy-Org/ComfyUI

Length of output: 43


🏁 Script executed:

# Check when this test file was created
git log --follow --format="%H %ad %s" -- tests-unit/comfy_test/model_detection_test.py 2>/dev/null | head -5 || echo "No git history"

Repository: Comfy-Org/ComfyUI

Length of output: 180


🏁 Script executed:

# Let's check the actual detection logic more carefully - see if shapes/values are read
rg "state_dict\[" comfy/model_detection.py -A 1 -B 1 | head -150

Repository: Comfy-Org/ComfyUI

Length of output: 7974


🏁 Script executed:

# Verify the detection logic doesn't hardcode H values
rg "3072|4096" comfy/model_detection.py

Repository: Comfy-Org/ComfyUI

Length of output: 438


🏁 Script executed:

# Check if there are any shape-based assertions or value comparisons for LongCat/Flux detection
rg -A 5 "LongCat|FluxSchnell" comfy/supported_models.py | head -50

Repository: Comfy-Org/ComfyUI

Length of output: 1759


🏁 Script executed:

# Verify the detection logic reads shapes from state_dict tensors
rg "in_channels|context_in_dim|hidden_size" comfy/model_detection.py | grep "state_dict\[" | head -20

Repository: Comfy-Org/ComfyUI

Length of output: 1562


🏁 Script executed:

# Check if detection has any hardcoded dimension checks for LongCat specifically
rg "3584|LongCat" comfy/model_detection.py -B 3 -A 3

Repository: Comfy-Org/ComfyUI

Length of output: 495


Reduce test tensor sizes to avoid multi-GB allocations.

These fixtures allocate several gigabytes of tensors (H=3072 with 19+38 blocks ~4.16 GB total), which can cause CI failures. The detection logic only inspects tensor shapes and key presence, not values—reducing H to 32 while keeping C_CTX at 3584 preserves correctness since context_in_dim is read from txt_in.weight.shape[1].

🧩 Suggested change
 def _make_longcat_comfyui_sd():
     """Minimal ComfyUI-format state dict for pre-converted LongCat-Image weights."""
     sd = {}
-    H = 3072
+    H = 32  # keep tests lightweight; only shapes and keys are used
     C_IN = 16
     C_CTX = 3584

 def _make_flux_schnell_comfyui_sd():
     """Minimal ComfyUI-format state dict for standard Flux Schnell."""
     sd = {}
-    H = 3072
+    H = 32  # keep tests lightweight; only shapes and keys are used
     C_IN = 16
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/comfy_test/model_detection_test.py` around lines 8 - 36, The
fixture _make_longcat_comfyui_sd allocates very large tensors (H=3072) causing
multi-GB memory usage; change H to a small value (e.g., 32) while keeping C_CTX
at 3584 so txt_in.weight.shape[1] still reflects the real context dim, and keep
other shape formulas (C_IN, C_CTX, and all uses like "img_in.weight",
"txt_in.weight", "time_in.*", "final_layer.*", and blocks in "double_blocks.*"
and "single_blocks.*") unchanged so the detection logic that reads tensor shapes
and key presence continues to work but without large memory allocations.

@Talmaj
Copy link
Author

Talmaj commented Feb 25, 2026

I've removed the change in model_config_from_unet_config and put LongCat-Image model in front of Fluxschnell. It auto-selects it correctly. Also it shouldn't influence auto-selection of other models:

LongCatImage's 5-key config only matches when all of context_in_dim == 3584, vec_in_dim is None, and
txt_ids_dims == [1, 2] are present, so it never falsely matches other Flux variants.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
tests-unit/comfy_test/model_detection_test.py (2)

7-36: ⚠️ Potential issue | 🟠 Major

Reduce H to avoid multi-GB allocations in CI (still unaddressed).

With H = 3072, _make_longcat_comfyui_sd allocates well over 4 GB of float32 tensors (19 double blocks alone account for 19 × (3×3072²) ≈ 537 M elements). Detection logic only inspects .shape and key presence—H = 32 works identically because context_in_dim is inferred from txt_in.weight.shape[1] = C_CTX = 3584, which is unchanged.

🔧 Proposed fix
 def _make_longcat_comfyui_sd():
     """Minimal ComfyUI-format state dict for pre-converted LongCat-Image weights."""
     sd = {}
-    H = 3072
+    H = 32  # keep tests lightweight; only shapes and key presence are used
     C_IN = 16
     C_CTX = 3584
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/comfy_test/model_detection_test.py` around lines 7 - 36, The test
helper _make_longcat_comfyui_sd creates very large tensors (H = 3072) causing
multi-GB allocations; change H to a small value (e.g., 32) and update any tensor
shapes that use H (all occurrences in sd keys like "img_in.weight",
"img_in.bias", "txt_in.weight", "txt_in.bias", "time_in.*", "final_layer.*", and
the loops creating "double_blocks.{i}..." and "single_blocks.{i}...") so the
detection logic still sees correct dimensionality but with tiny allocations;
leave C_CTX and loop counts unchanged so context_in_dim inference via
txt_in.weight.shape[1] remains the same.

39-59: ⚠️ Potential issue | 🟠 Major

Same H = 3072 allocation issue in _make_flux_schnell_comfyui_sd.

Same fix applies; context_in_dim is read from txt_in.weight.shape[1] = 4096, independent of H.

🔧 Proposed fix
 def _make_flux_schnell_comfyui_sd():
     """Minimal ComfyUI-format state dict for standard Flux Schnell."""
     sd = {}
-    H = 3072
+    H = 32  # keep tests lightweight; only shapes and key presence are used
     C_IN = 16
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests-unit/comfy_test/model_detection_test.py` around lines 39 - 59, The test
helper _make_flux_schnell_comfyui_sd hardcodes txt_in.weight with shape (H,
4096) which conflates H with context dimension; introduce a separate variable
(e.g., CONTEXT_IN = 4096) and allocate txt_in.weight as torch.empty(H,
CONTEXT_IN) (and use CONTEXT_IN wherever the code should reflect the
context/input embedding width), leaving H = 3072 for channel/hidden sizes—this
ensures context_in_dim is read correctly from txt_in.weight.shape[1] and avoids
mixing H and context dimensions.
comfy/text_encoders/longcat_image.py (1)

137-143: ⚠️ Potential issue | 🟠 Major

The template_end == -1 guard fires too late; the +3 check can accidentally fire on index 0/1 (still unaddressed).

When no <|im_start|> (151644) token is found, template_end stays -1 after the loop. The block at Lines 137–140 then evaluates out.shape[1] > 2 (almost always True) and accidentally inspects tok_pairs[0] and tok_pairs[1] (because -1 + 1 = 0 and -1 + 2 = 1). If those tokens happen to be 872 and 198, template_end becomes 2 and the guard at Line 142 is bypassed, causing out[:, 2:] to silently discard the first two tokens.

The fix is to only run the +3 newline adjustment when template_end was actually set by the loop:

🛠️ Proposed fix
-        if out.shape[1] > (template_end + 3):
-            if tok_pairs[template_end + 1][0] == 872:
-                if tok_pairs[template_end + 2][0] == 198:
-                    template_end += 3
-
-        if template_end == -1:
-            template_end = 0
+        if template_end == -1:
+            template_end = 0
+        elif out.shape[1] > (template_end + 3):
+            if tok_pairs[template_end + 1][0] == 872:
+                if tok_pairs[template_end + 2][0] == 198:
+                    template_end += 3
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@comfy/text_encoders/longcat_image.py` around lines 137 - 143, The post-loop
"+3" adjustment currently runs even when template_end is still -1 and can index
tok_pairs[0/1]; change the logic so the checks that inspect
tok_pairs[template_end + 1] and tok_pairs[template_end + 2] only run when
template_end != -1 (i.e., the loop actually found the <|im_start|> marker).
Concretely, wrap the entire if-block that tests out.shape and tok_pairs[...]
with a guard like "if template_end != -1 and out.shape[1] > (template_end +
3):", leaving the existing fallback that sets template_end = 0 after that. This
ensures tok_pairs and template_end adjustments only occur when template_end was
set by the earlier search.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@comfy/text_encoders/longcat_image.py`:
- Around line 137-143: The post-loop "+3" adjustment currently runs even when
template_end is still -1 and can index tok_pairs[0/1]; change the logic so the
checks that inspect tok_pairs[template_end + 1] and tok_pairs[template_end + 2]
only run when template_end != -1 (i.e., the loop actually found the <|im_start|>
marker). Concretely, wrap the entire if-block that tests out.shape and
tok_pairs[...] with a guard like "if template_end != -1 and out.shape[1] >
(template_end + 3):", leaving the existing fallback that sets template_end = 0
after that. This ensures tok_pairs and template_end adjustments only occur when
template_end was set by the earlier search.

In `@tests-unit/comfy_test/model_detection_test.py`:
- Around line 7-36: The test helper _make_longcat_comfyui_sd creates very large
tensors (H = 3072) causing multi-GB allocations; change H to a small value
(e.g., 32) and update any tensor shapes that use H (all occurrences in sd keys
like "img_in.weight", "img_in.bias", "txt_in.weight", "txt_in.bias",
"time_in.*", "final_layer.*", and the loops creating "double_blocks.{i}..." and
"single_blocks.{i}...") so the detection logic still sees correct dimensionality
but with tiny allocations; leave C_CTX and loop counts unchanged so
context_in_dim inference via txt_in.weight.shape[1] remains the same.
- Around line 39-59: The test helper _make_flux_schnell_comfyui_sd hardcodes
txt_in.weight with shape (H, 4096) which conflates H with context dimension;
introduce a separate variable (e.g., CONTEXT_IN = 4096) and allocate
txt_in.weight as torch.empty(H, CONTEXT_IN) (and use CONTEXT_IN wherever the
code should reflect the context/input embedding width), leaving H = 3072 for
channel/hidden sizes—this ensures context_in_dim is read correctly from
txt_in.weight.shape[1] and avoids mixing H and context dimensions.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 310924a and 4b6fe40.

📒 Files selected for processing (6)
  • blueprints/Text to Image (LongCat-Image).json
  • comfy/model_detection.py
  • comfy/supported_models.py
  • comfy/text_encoders/longcat_image.py
  • comfy_extras/nodes_longcat_image.py
  • tests-unit/comfy_test/model_detection_test.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • comfy/model_detection.py
  • blueprints/Text to Image (LongCat-Image).json

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants