Native LongCat-Image implementation#12597
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds LongCat-Image support across the codebase: a new Text-to-Image blueprint JSON; new LongCatImage model class and Flux adjustments in comfy/model_base.py; UNet detection update in comfy/model_detection.py; CLIPType enum and LONGCAT_IMAGE text-encoder loading in comfy/sd.py; new supported model entry in comfy/supported_models.py; a LongCatImage tokenizer/TE implementation in comfy/text_encoders/longcat_image.py; two Comfy nodes and an extension in comfy_extras/nodes_longcat_image.py; CLIPLoader option and extras registration; and unit tests for model detection and conversion. 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
tests-unit/comfy_test/model_detection_test.py (1)
73-73: Unused variableoriginal_models.
original_modelsis assigned but never referenced. Likely leftover from a manual save/restore approach that was replaced bypatch.object.🧹 Remove unused variable
sd = _make_longcat_diffusers_sd() unet_config = detect_unet_config(sd, "") - original_models = comfy.supported_models.models longcat_cls = comfy.supported_models.LongCatImage🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/comfy_test/model_detection_test.py` at line 73, Remove the unused local variable original_models assigned from comfy.supported_models.models in the test; since patch.object is handling temporary replacement/restore, delete the assignment to original_models to eliminate the dead code and keep the test clean (look for the assignment to original_models and the reference to comfy.supported_models.models in model_detection_test.py).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy_extras/nodes_longcat_image.py`:
- Around line 61-82: The code assumes H and W are divisible by ps=2 before
reshaping (see noise.reshape(...), cond_packed reshape and renormed), which will
raise on odd spatial sizes; add a defensive check right after B, C, H, W =
denoised.shape to verify H % ps == 0 and W % ps == 0 and raise a clear
ValueError including the offending H/W and ps (or alternatively apply symmetric
padding to x/denoised/cond_denoised to make them divisible by ps before the
pack/unpack operations), then proceed with the existing noise/cond packing,
scaling and renorming logic unchanged.
In `@comfy/text_encoders/longcat_image.py`:
- Around line 70-95: In tokenize_with_weights: avoid letting
base_tok.tokenize_with_weights produce 512-length padding before you add the
LongCat template; call base_tok.tokenize_with_weights with padding disabled
(e.g., disable_padding=True or equivalent kwarg) so prompt_pairs is produced
without pre-padding, then build prefix_pairs, prompt_pairs and suffix_pairs into
combined, and only after combining perform truncation/padding to model length
(use your tokenizer's pad/truncate utility or call the shared super method once
on the final combined token list) so prefix_ids/suffix_ids are not separated by
mid-prompt padding; refer to tokenize_with_weights,
base_tok.tokenize_with_weights, prefix_ids, suffix_ids, prompt_pairs and
combined to locate where to change.
- Around line 102-144: The slice logic in encode_token_weights can use
template_end == -1 (no <|im_start|> found) which makes out = out[:, -1:] (last
token); change encode_token_weights to treat a missing marker by setting
template_end = 0 before slicing (or otherwise avoid negative slice) and only
apply the "+3 newline adjustment" when a real marker was detected (i.e., only
run the tok_pairs[template_end + 1]/[+2] checks if template_end was set from the
loop). Update references in encode_token_weights (template_end, tok_pairs, out,
extra, suffix_start) so the slicing/out = out[:, template_end:] and subsequent
attention_mask adjustments are guarded by the marker presence to avoid
accidentally keeping only the last token.
---
Nitpick comments:
In `@tests-unit/comfy_test/model_detection_test.py`:
- Line 73: Remove the unused local variable original_models assigned from
comfy.supported_models.models in the test; since patch.object is handling
temporary replacement/restore, delete the assignment to original_models to
eliminate the dead code and keep the test clean (look for the assignment to
original_models and the reference to comfy.supported_models.models in
model_detection_test.py).
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
blueprints/Text to Image (LongCat-Image).jsoncomfy/model_base.pycomfy/model_detection.pycomfy/sd.pycomfy/supported_models.pycomfy/text_encoders/longcat_image.pycomfy_extras/nodes_longcat_image.pynodes.pytests-unit/comfy_test/model_detection_test.py
bfd302f to
81a3792
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests-unit/comfy_test/model_detection_test.py (2)
84-89: Considerisinstanceovertype(...).__name__for class identity assertions.String-based class-name checks will silently pass if the class is renamed or imported under an alias.
♻️ Proposed fix
- assert type(result).__name__ == "LongCatImage", ( - f"Expected LongCatImage with order {label}, got {type(result).__name__}" - ) + assert isinstance(result, comfy.supported_models.LongCatImage), ( + f"Expected LongCatImage with order {label}, got {type(result)}" + )And at line 101:
- assert type(model_config).__name__ == "LongCatImage" + assert isinstance(model_config, comfy.supported_models.LongCatImage)Also applies to: 99-101
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/comfy_test/model_detection_test.py` around lines 84 - 89, Replace fragile string-based class checks with real type checks: instead of asserting type(result).__name__ == "LongCatImage" use an isinstance assertion against the actual class (e.g., assert isinstance(result, LongCatImage)). Update both occurrences (the assertion around model_config_from_unet_config and the similar check at lines ~99-101) and ensure LongCatImage is imported or referenced from the correct module so the isinstance call resolves.
103-113: Test only verifies key presence/absence, skipping the two non-trivial transforms.The PR description calls out Q/K/V fusion and a scale/shift half-swap in
process_unet_state_dictas the critical parts of the conversion. Neither is exercised here — verifying that, say,transformer_blocks.0.attn.to_q.weight+to_k.weight+to_v.weightare fused into a singledouble_blocks.0.img_attn.qkv.weightwith the right shape, and that thenorm_outscale/shift halves are swapped, would meaningfully increase confidence in the conversion correctness.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/comfy_test/model_detection_test.py` around lines 103 - 113, The test test_longcat_process_unet_state_dict_converts_keys only checks presence/absence of keys but does not validate the two non-trivial transforms in process_unet_state_dict: Q/K/V fusion and the norm_out scale/shift half-swap. Update the test to build source weights for transformer_blocks.0.attn.to_q/to_k/to_v and the norm_out affine, run converted = model_config.process_unet_state_dict(...), then assert the fused tensor exists at double_blocks.0.img_attn.qkv.weight with the expected concatenated shape and contents (verify slices match original to_q/k/v), and assert norm_out parameters have their scale/shift halves swapped compared to the input; reference test_longcat_process_unet_state_dict_converts_keys, process_unet_state_dict, transformer_blocks.*, attn.to_q/to_k/to_v, double_blocks.0.img_attn.qkv.weight, and norm_out in your assertions.comfy_extras/nodes_longcat_image.py (1)
82-83: Replace deprecatedtorch.normwithtorch.linalg.vector_norm.
torch.normis deprecated and may be removed in a future PyTorch release; its documentation and behavior may be incorrect, and it is no longer actively maintained. The recommended replacement for vector norms istorch.linalg.vector_norm().♻️ Proposed fix
- noise_norm = torch.norm(noise_packed, dim=-1, keepdim=True) - cond_norm = torch.norm(cond_packed, dim=-1, keepdim=True) + noise_norm = torch.linalg.vector_norm(noise_packed, dim=-1, keepdim=True) + cond_norm = torch.linalg.vector_norm(cond_packed, dim=-1, keepdim=True)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy_extras/nodes_longcat_image.py` around lines 82 - 83, Replace deprecated torch.norm calls computing per-vector norms for noise_packed and cond_packed with torch.linalg.vector_norm; specifically update the expressions that assign noise_norm and cond_norm (currently using torch.norm(noise_packed, dim=-1, keepdim=True) and torch.norm(cond_packed, dim=-1, keepdim=True)) to use torch.linalg.vector_norm(noise_packed, dim=-1, keepdim=True) and torch.linalg.vector_norm(cond_packed, dim=-1, keepdim=True) respectively so behavior and signature remain the same but use the supported API.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/model_detection.py`:
- Around line 285-313: The current three-key detection in model_detection.py
(keys '{}x_embedder.weight', '{}transformer_blocks.0.attn.to_q.weight',
'{}single_transformer_blocks.0.attn.to_q.weight' using key_prefix) misidentifies
vanilla Flux diffusers as LongCat-Image; tighten the condition by requiring a
LongCat-specific key (use the existing ctx_key =
'{}context_embedder.weight'.format(key_prefix)) to be present as an additional
positive constraint before building dit_config, so only state dicts that include
context_embedder.weight are considered LongCat-Image (leave count_blocks usage
and subsequent field population unchanged).
---
Nitpick comments:
In `@comfy_extras/nodes_longcat_image.py`:
- Around line 82-83: Replace deprecated torch.norm calls computing per-vector
norms for noise_packed and cond_packed with torch.linalg.vector_norm;
specifically update the expressions that assign noise_norm and cond_norm
(currently using torch.norm(noise_packed, dim=-1, keepdim=True) and
torch.norm(cond_packed, dim=-1, keepdim=True)) to use
torch.linalg.vector_norm(noise_packed, dim=-1, keepdim=True) and
torch.linalg.vector_norm(cond_packed, dim=-1, keepdim=True) respectively so
behavior and signature remain the same but use the supported API.
In `@tests-unit/comfy_test/model_detection_test.py`:
- Around line 84-89: Replace fragile string-based class checks with real type
checks: instead of asserting type(result).__name__ == "LongCatImage" use an
isinstance assertion against the actual class (e.g., assert isinstance(result,
LongCatImage)). Update both occurrences (the assertion around
model_config_from_unet_config and the similar check at lines ~99-101) and ensure
LongCatImage is imported or referenced from the correct module so the isinstance
call resolves.
- Around line 103-113: The test
test_longcat_process_unet_state_dict_converts_keys only checks presence/absence
of keys but does not validate the two non-trivial transforms in
process_unet_state_dict: Q/K/V fusion and the norm_out scale/shift half-swap.
Update the test to build source weights for
transformer_blocks.0.attn.to_q/to_k/to_v and the norm_out affine, run converted
= model_config.process_unet_state_dict(...), then assert the fused tensor exists
at double_blocks.0.img_attn.qkv.weight with the expected concatenated shape and
contents (verify slices match original to_q/k/v), and assert norm_out parameters
have their scale/shift halves swapped compared to the input; reference
test_longcat_process_unet_state_dict_converts_keys, process_unet_state_dict,
transformer_blocks.*, attn.to_q/to_k/to_v, double_blocks.0.img_attn.qkv.weight,
and norm_out in your assertions.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
blueprints/Text to Image (LongCat-Image).jsoncomfy/model_base.pycomfy/model_detection.pycomfy/sd.pycomfy/supported_models.pycomfy/text_encoders/longcat_image.pycomfy_extras/nodes_longcat_image.pynodes.pytests-unit/comfy_test/model_detection_test.py
✅ Files skipped from review due to trivial changes (1)
- blueprints/Text to Image (LongCat-Image).json
🚧 Files skipped from review as they are similar to previous changes (1)
- nodes.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@comfy/model_detection.py`:
- Around line 827-838: The selection currently adds
len(model_config.required_keys) into specificity even when no state_dict was
provided, biasing picks; modify model_config_from_unet_config so that
required_keys are only counted when a state_dict is present/used (i.e., only add
len(model_config.required_keys) to specificity if state_dict is not None and
thus those keys could be validated), preserving the previous first-match
ordering when state_dict is None; refer to model_config_from_unet_config,
best_specificity, model_config.required_keys and the matches(...) call to
implement this conditional weighting.
comfy/model_detection.py
Outdated
| def model_config_from_unet_config(unet_config, state_dict=None): | ||
| best = None | ||
| best_specificity = -1 | ||
| for model_config in comfy.supported_models.models: | ||
| if model_config.matches(unet_config, state_dict): | ||
| return model_config(unet_config) | ||
| specificity = len(model_config.unet_config) + len(model_config.required_keys) | ||
| if specificity > best_specificity: | ||
| best = model_config | ||
| best_specificity = specificity | ||
|
|
||
| if best is not None: | ||
| return best(unet_config) |
There was a problem hiding this comment.
Avoid weighting required_keys when state_dict is None.
Line 832: specificity now includes required_keys even when they weren’t validated (e.g., model_config_from_diffusers_unet passes no state_dict). This can bias selection toward configs with larger required_keys sets and change behavior vs. the previous “first match” ordering.
Proposed fix
- specificity = len(model_config.unet_config) + len(model_config.required_keys)
+ specificity = len(model_config.unet_config) + (len(model_config.required_keys) if state_dict is not None else 0)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@comfy/model_detection.py` around lines 827 - 838, The selection currently
adds len(model_config.required_keys) into specificity even when no state_dict
was provided, biasing picks; modify model_config_from_unet_config so that
required_keys are only counted when a state_dict is present/used (i.e., only add
len(model_config.required_keys) to specificity if state_dict is not None and
thus those keys could be validated), preserving the previous first-match
ordering when state_dict is None; refer to model_config_from_unet_config,
best_specificity, model_config.required_keys and the matches(...) call to
implement this conditional weighting.
comfy/supported_models.py
Outdated
| prefix = "double_blocks.{}.".format(idx) | ||
|
|
||
| if rest.startswith("norm1.linear."): | ||
| out_sd[prefix + "img_mod.lin." + rest.split(".")[-1]] = v |
There was a problem hiding this comment.
Something we can do is repackage/convert the model weights to a format that is easier to load in comfyui instead of having all this logic here it can be a conversion.py file in the hf repo where we host our converted weights.
There was a problem hiding this comment.
Is there some standard for repackaging that I can follow?
There was a problem hiding this comment.
The standard is to make something that doesn't need to be converted by comfyui internally to work.
comfy/model_detection.py
Outdated
|
|
||
| return dit_config | ||
|
|
||
| if '{}x_embedder.weight'.format(key_prefix) in state_dict_keys and '{}transformer_blocks.0.attn.to_q.weight'.format(key_prefix) in state_dict_keys and '{}single_transformer_blocks.0.attn.to_q.weight'.format(key_prefix) in state_dict_keys and '{}context_embedder.weight'.format(key_prefix) in state_dict_keys and '{}time_embed.timestep_embedder.linear_1.weight'.format(key_prefix) in state_dict_keys: #LongCat-Image (diffusers format, Flux variant) |
There was a problem hiding this comment.
Repackaging the model file ourselves will also make this logic easier.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
blueprints/Text to Image (LongCat-Image).json (1)
1-1: Expose a negative prompt input in the subgraph.The second
CLIPTextEncodeLongCatImagenode isn’t wired to any input, so users can’t set a negative prompt from the blueprint. Consider adding a subgraph input and link for negative text (or remove the node if it’s intentionally fixed).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@blueprints/Text` to Image (LongCat-Image).json at line 1, The subgraph lacks a user-facing negative prompt because the second CLIPTextEncodeLongCatImage node (node id 5, type CLIPTextEncodeLongCatImage) has its "text" input unlinked; add a subgraph input (e.g., inp-neg-text) for the negative prompt and create a link from that subgraph input to node 5's "text" input so users can supply a negative prompt, or if the node is intentionally unused, remove node id 5 and its links (including the link to KSampler) to avoid confusion.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests-unit/comfy_test/model_detection_test.py`:
- Around line 8-36: The fixture _make_longcat_comfyui_sd allocates very large
tensors (H=3072) causing multi-GB memory usage; change H to a small value (e.g.,
32) while keeping C_CTX at 3584 so txt_in.weight.shape[1] still reflects the
real context dim, and keep other shape formulas (C_IN, C_CTX, and all uses like
"img_in.weight", "txt_in.weight", "time_in.*", "final_layer.*", and blocks in
"double_blocks.*" and "single_blocks.*") unchanged so the detection logic that
reads tensor shapes and key presence continues to work but without large memory
allocations.
---
Nitpick comments:
In `@blueprints/Text` to Image (LongCat-Image).json:
- Line 1: The subgraph lacks a user-facing negative prompt because the second
CLIPTextEncodeLongCatImage node (node id 5, type CLIPTextEncodeLongCatImage) has
its "text" input unlinked; add a subgraph input (e.g., inp-neg-text) for the
negative prompt and create a link from that subgraph input to node 5's "text"
input so users can supply a negative prompt, or if the node is intentionally
unused, remove node id 5 and its links (including the link to KSampler) to avoid
confusion.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
blueprints/Text to Image (LongCat-Image).jsoncomfy/model_detection.pycomfy/supported_models.pycomfy/text_encoders/longcat_image.pytests-unit/comfy_test/model_detection_test.py
🚧 Files skipped from review as they are similar to previous changes (2)
- comfy/model_detection.py
- comfy/text_encoders/longcat_image.py
| def _make_longcat_comfyui_sd(): | ||
| """Minimal ComfyUI-format state dict for pre-converted LongCat-Image weights.""" | ||
| sd = {} | ||
| H = 3072 | ||
| C_IN = 16 | ||
| C_CTX = 3584 | ||
|
|
||
| sd["img_in.weight"] = torch.empty(H, C_IN * 4) | ||
| sd["img_in.bias"] = torch.empty(H) | ||
| sd["txt_in.weight"] = torch.empty(H, C_CTX) | ||
| sd["txt_in.bias"] = torch.empty(H) | ||
|
|
||
| sd["time_in.in_layer.weight"] = torch.empty(H, 256) | ||
| sd["time_in.in_layer.bias"] = torch.empty(H) | ||
| sd["time_in.out_layer.weight"] = torch.empty(H, H) | ||
| sd["time_in.out_layer.bias"] = torch.empty(H) | ||
|
|
||
| sd["final_layer.adaLN_modulation.1.weight"] = torch.empty(2 * H, H) | ||
| sd["final_layer.adaLN_modulation.1.bias"] = torch.empty(2 * H) | ||
| sd["final_layer.linear.weight"] = torch.empty(C_IN * 4, H) | ||
| sd["final_layer.linear.bias"] = torch.empty(C_IN * 4) | ||
|
|
||
| for i in range(19): | ||
| sd[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = torch.empty(128) | ||
| sd[f"double_blocks.{i}.img_attn.qkv.weight"] = torch.empty(3 * H, H) | ||
| sd[f"double_blocks.{i}.img_mod.lin.weight"] = torch.empty(H, H) | ||
| for i in range(38): | ||
| sd[f"single_blocks.{i}.modulation.lin.weight"] = torch.empty(H, H) | ||
|
|
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's check if the test file exists and view the exact code
cat -n tests-unit/comfy_test/model_detection_test.py | head -80Repository: Comfy-Org/ComfyUI
Length of output: 3589
🏁 Script executed:
# Check git status to see if this is a new file or modified
git log --oneline -- tests-unit/comfy_test/model_detection_test.py 2>/dev/null | head -5 || echo "File may be new or not tracked"Repository: Comfy-Org/ComfyUI
Length of output: 116
🏁 Script executed:
# Calculate memory footprint of the tensors being allocated
python3 << 'EOF'
import torch
# Current allocation (H=3072)
H_current = 3072
C_IN = 16
C_CTX = 3584
# Calculate sizes for _make_longcat_comfyui_sd
tensors_longcat = {
"img_in.weight": (H_current, C_IN * 4),
"img_in.bias": (H_current,),
"txt_in.weight": (H_current, C_CTX),
"txt_in.bias": (H_current,),
"time_in.in_layer.weight": (H_current, 256),
"time_in.in_layer.bias": (H_current,),
"time_in.out_layer.weight": (H_current, H_current),
"time_in.out_layer.bias": (H_current,),
"final_layer.adaLN_modulation.1.weight": (2 * H_current, H_current),
"final_layer.adaLN_modulation.1.bias": (2 * H_current,),
"final_layer.linear.weight": (C_IN * 4, H_current),
"final_layer.linear.bias": (C_IN * 4,),
}
# Add looped tensors (19 blocks)
for i in range(19):
tensors_longcat[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = (128,)
tensors_longcat[f"double_blocks.{i}.img_attn.qkv.weight"] = (3 * H_current, H_current)
tensors_longcat[f"double_blocks.{i}.img_mod.lin.weight"] = (H_current, H_current)
# Add 38 single blocks
for i in range(38):
tensors_longcat[f"single_blocks.{i}.modulation.lin.weight"] = (H_current, H_current)
# Calculate total memory (float32 = 4 bytes per element)
total_elements = sum(1 for shape in tensors_longcat.values() for _ in [1] if all(isinstance(x, int) for x in shape))
total_elements = sum(torch.Size(shape).numel() for shape in tensors_longcat.values())
total_bytes = total_elements * 4 # float32
total_gb = total_bytes / (1024**3)
print(f"H=3072 allocations:")
print(f" Total elements: {total_elements:,}")
print(f" Total memory: {total_gb:.2f} GB")
print()
# Proposed allocation (H=32)
H_proposed = 32
tensors_longcat_small = {
"img_in.weight": (H_proposed, C_IN * 4),
"img_in.bias": (H_proposed,),
"txt_in.weight": (H_proposed, C_CTX),
"txt_in.bias": (H_proposed,),
"time_in.in_layer.weight": (H_proposed, 256),
"time_in.in_layer.bias": (H_proposed,),
"time_in.out_layer.weight": (H_proposed, H_proposed),
"time_in.out_layer.bias": (H_proposed,),
"final_layer.adaLN_modulation.1.weight": (2 * H_proposed, H_proposed),
"final_layer.adaLN_modulation.1.bias": (2 * H_proposed,),
"final_layer.linear.weight": (C_IN * 4, H_proposed),
"final_layer.linear.bias": (C_IN * 4,),
}
for i in range(19):
tensors_longcat_small[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = (128,)
tensors_longcat_small[f"double_blocks.{i}.img_attn.qkv.weight"] = (3 * H_proposed, H_proposed)
tensors_longcat_small[f"double_blocks.{i}.img_mod.lin.weight"] = (H_proposed, H_proposed)
for i in range(38):
tensors_longcat_small[f"single_blocks.{i}.modulation.lin.weight"] = (H_proposed, H_proposed)
total_elements_small = sum(torch.Size(shape).numel() for shape in tensors_longcat_small.values())
total_bytes_small = total_elements_small * 4
total_gb_small = total_bytes_small / (1024**3)
print(f"H=32 allocations:")
print(f" Total elements: {total_elements_small:,}")
print(f" Total memory: {total_gb_small:.4f} GB ({total_bytes_small / (1024**2):.2f} MB)")
print()
print(f"Reduction: {total_gb / total_gb_small:.1f}x smaller")
EOFRepository: Comfy-Org/ComfyUI
Length of output: 178
🏁 Script executed:
# Calculate memory footprint manually (without torch module)
python3 << 'EOF'
# Current allocation (H=3072)
H_current = 3072
C_IN = 16
C_CTX = 3584
# Calculate elements for _make_longcat_comfyui_sd
elements = {}
elements["img_in.weight"] = H_current * (C_IN * 4) # 3072 * 64
elements["img_in.bias"] = H_current
elements["txt_in.weight"] = H_current * C_CTX # 3072 * 3584
elements["txt_in.bias"] = H_current
elements["time_in.in_layer.weight"] = H_current * 256
elements["time_in.in_layer.bias"] = H_current
elements["time_in.out_layer.weight"] = H_current * H_current # 3072 * 3072
elements["time_in.out_layer.bias"] = H_current
elements["final_layer.adaLN_modulation.1.weight"] = (2 * H_current) * H_current # 6144 * 3072
elements["final_layer.adaLN_modulation.1.bias"] = 2 * H_current
elements["final_layer.linear.weight"] = (C_IN * 4) * H_current
elements["final_layer.linear.bias"] = C_IN * 4
# Add loops
for i in range(19):
elements[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = 128
elements[f"double_blocks.{i}.img_attn.qkv.weight"] = 3 * H_current * H_current
elements[f"double_blocks.{i}.img_mod.lin.weight"] = H_current * H_current
for i in range(38):
elements[f"single_blocks.{i}.modulation.lin.weight"] = H_current * H_current
total_elements = sum(elements.values())
total_bytes = total_elements * 4 # float32
total_gb = total_bytes / (1024**3)
print(f"Longcat with H=3072:")
print(f" Total elements: {total_elements:,}")
print(f" Total memory: {total_gb:.2f} GB")
print()
# Proposed allocation (H=32)
H_proposed = 32
elements_small = {}
elements_small["img_in.weight"] = H_proposed * (C_IN * 4)
elements_small["img_in.bias"] = H_proposed
elements_small["txt_in.weight"] = H_proposed * C_CTX
elements_small["txt_in.bias"] = H_proposed
elements_small["time_in.in_layer.weight"] = H_proposed * 256
elements_small["time_in.in_layer.bias"] = H_proposed
elements_small["time_in.out_layer.weight"] = H_proposed * H_proposed
elements_small["time_in.out_layer.bias"] = H_proposed
elements_small["final_layer.adaLN_modulation.1.weight"] = (2 * H_proposed) * H_proposed
elements_small["final_layer.adaLN_modulation.1.bias"] = 2 * H_proposed
elements_small["final_layer.linear.weight"] = (C_IN * 4) * H_proposed
elements_small["final_layer.linear.bias"] = C_IN * 4
for i in range(19):
elements_small[f"double_blocks.{i}.img_attn.norm.key_norm.weight"] = 128
elements_small[f"double_blocks.{i}.img_attn.qkv.weight"] = 3 * H_proposed * H_proposed
elements_small[f"double_blocks.{i}.img_mod.lin.weight"] = H_proposed * H_proposed
for i in range(38):
elements_small[f"single_blocks.{i}.modulation.lin.weight"] = H_proposed * H_proposed
total_elements_small = sum(elements_small.values())
total_bytes_small = total_elements_small * 4
total_gb_small = total_bytes_small / (1024**3)
print(f"Longcat with H=32:")
print(f" Total elements: {total_elements_small:,}")
print(f" Total memory: {total_gb_small:.4f} GB ({total_bytes_small / (1024**2):.2f} MB)")
print()
print(f"Reduction factor: {total_gb / total_gb_small:.0f}x smaller")
EOFRepository: Comfy-Org/ComfyUI
Length of output: 252
🏁 Script executed:
# Check how detect_unet_config and model_config_from_unet_config use the state dict
rg -A 10 "def detect_unet_config" --type pyRepository: Comfy-Org/ComfyUI
Length of output: 1030
🏁 Script executed:
# Search for how these detection functions inspect the state dict
rg "double_blocks|single_blocks|img_attn" comfy/ --type py -A 2 | head -100Repository: Comfy-Org/ComfyUI
Length of output: 7348
🏁 Script executed:
# Check the git diff to see if these test functions are newly added
git diff HEAD~1 tests-unit/comfy_test/model_detection_test.py 2>/dev/null | head -150 || echo "Git diff not available, checking if file is new"Repository: Comfy-Org/ComfyUI
Length of output: 43
🏁 Script executed:
# Check when this test file was created
git log --follow --format="%H %ad %s" -- tests-unit/comfy_test/model_detection_test.py 2>/dev/null | head -5 || echo "No git history"Repository: Comfy-Org/ComfyUI
Length of output: 180
🏁 Script executed:
# Let's check the actual detection logic more carefully - see if shapes/values are read
rg "state_dict\[" comfy/model_detection.py -A 1 -B 1 | head -150Repository: Comfy-Org/ComfyUI
Length of output: 7974
🏁 Script executed:
# Verify the detection logic doesn't hardcode H values
rg "3072|4096" comfy/model_detection.pyRepository: Comfy-Org/ComfyUI
Length of output: 438
🏁 Script executed:
# Check if there are any shape-based assertions or value comparisons for LongCat/Flux detection
rg -A 5 "LongCat|FluxSchnell" comfy/supported_models.py | head -50Repository: Comfy-Org/ComfyUI
Length of output: 1759
🏁 Script executed:
# Verify the detection logic reads shapes from state_dict tensors
rg "in_channels|context_in_dim|hidden_size" comfy/model_detection.py | grep "state_dict\[" | head -20Repository: Comfy-Org/ComfyUI
Length of output: 1562
🏁 Script executed:
# Check if detection has any hardcoded dimension checks for LongCat specifically
rg "3584|LongCat" comfy/model_detection.py -B 3 -A 3Repository: Comfy-Org/ComfyUI
Length of output: 495
Reduce test tensor sizes to avoid multi-GB allocations.
These fixtures allocate several gigabytes of tensors (H=3072 with 19+38 blocks ~4.16 GB total), which can cause CI failures. The detection logic only inspects tensor shapes and key presence, not values—reducing H to 32 while keeping C_CTX at 3584 preserves correctness since context_in_dim is read from txt_in.weight.shape[1].
🧩 Suggested change
def _make_longcat_comfyui_sd():
"""Minimal ComfyUI-format state dict for pre-converted LongCat-Image weights."""
sd = {}
- H = 3072
+ H = 32 # keep tests lightweight; only shapes and keys are used
C_IN = 16
C_CTX = 3584
def _make_flux_schnell_comfyui_sd():
"""Minimal ComfyUI-format state dict for standard Flux Schnell."""
sd = {}
- H = 3072
+ H = 32 # keep tests lightweight; only shapes and keys are used
C_IN = 16🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests-unit/comfy_test/model_detection_test.py` around lines 8 - 36, The
fixture _make_longcat_comfyui_sd allocates very large tensors (H=3072) causing
multi-GB memory usage; change H to a small value (e.g., 32) while keeping C_CTX
at 3584 so txt_in.weight.shape[1] still reflects the real context dim, and keep
other shape formulas (C_IN, C_CTX, and all uses like "img_in.weight",
"txt_in.weight", "time_in.*", "final_layer.*", and blocks in "double_blocks.*"
and "single_blocks.*") unchanged so the detection logic that reads tensor shapes
and key presence continues to work but without large memory allocations.
310924a to
4b6fe40
Compare
|
I've removed the change in
|
There was a problem hiding this comment.
♻️ Duplicate comments (3)
tests-unit/comfy_test/model_detection_test.py (2)
7-36:⚠️ Potential issue | 🟠 MajorReduce
Hto avoid multi-GB allocations in CI (still unaddressed).With
H = 3072,_make_longcat_comfyui_sdallocates well over 4 GB of float32 tensors (19 double blocks alone account for19 × (3×3072²) ≈ 537 Melements). Detection logic only inspects.shapeand key presence—H = 32works identically becausecontext_in_dimis inferred fromtxt_in.weight.shape[1] = C_CTX = 3584, which is unchanged.🔧 Proposed fix
def _make_longcat_comfyui_sd(): """Minimal ComfyUI-format state dict for pre-converted LongCat-Image weights.""" sd = {} - H = 3072 + H = 32 # keep tests lightweight; only shapes and key presence are used C_IN = 16 C_CTX = 3584🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/comfy_test/model_detection_test.py` around lines 7 - 36, The test helper _make_longcat_comfyui_sd creates very large tensors (H = 3072) causing multi-GB allocations; change H to a small value (e.g., 32) and update any tensor shapes that use H (all occurrences in sd keys like "img_in.weight", "img_in.bias", "txt_in.weight", "txt_in.bias", "time_in.*", "final_layer.*", and the loops creating "double_blocks.{i}..." and "single_blocks.{i}...") so the detection logic still sees correct dimensionality but with tiny allocations; leave C_CTX and loop counts unchanged so context_in_dim inference via txt_in.weight.shape[1] remains the same.
39-59:⚠️ Potential issue | 🟠 MajorSame
H = 3072allocation issue in_make_flux_schnell_comfyui_sd.Same fix applies;
context_in_dimis read fromtxt_in.weight.shape[1] = 4096, independent ofH.🔧 Proposed fix
def _make_flux_schnell_comfyui_sd(): """Minimal ComfyUI-format state dict for standard Flux Schnell.""" sd = {} - H = 3072 + H = 32 # keep tests lightweight; only shapes and key presence are used C_IN = 16🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests-unit/comfy_test/model_detection_test.py` around lines 39 - 59, The test helper _make_flux_schnell_comfyui_sd hardcodes txt_in.weight with shape (H, 4096) which conflates H with context dimension; introduce a separate variable (e.g., CONTEXT_IN = 4096) and allocate txt_in.weight as torch.empty(H, CONTEXT_IN) (and use CONTEXT_IN wherever the code should reflect the context/input embedding width), leaving H = 3072 for channel/hidden sizes—this ensures context_in_dim is read correctly from txt_in.weight.shape[1] and avoids mixing H and context dimensions.comfy/text_encoders/longcat_image.py (1)
137-143:⚠️ Potential issue | 🟠 MajorThe
template_end == -1guard fires too late; the+3check can accidentally fire on index 0/1 (still unaddressed).When no
<|im_start|>(151644) token is found,template_endstays-1after the loop. The block at Lines 137–140 then evaluatesout.shape[1] > 2(almost alwaysTrue) and accidentally inspectstok_pairs[0]andtok_pairs[1](because-1 + 1 = 0and-1 + 2 = 1). If those tokens happen to be 872 and 198,template_endbecomes2and the guard at Line 142 is bypassed, causingout[:, 2:]to silently discard the first two tokens.The fix is to only run the
+3newline adjustment whentemplate_endwas actually set by the loop:🛠️ Proposed fix
- if out.shape[1] > (template_end + 3): - if tok_pairs[template_end + 1][0] == 872: - if tok_pairs[template_end + 2][0] == 198: - template_end += 3 - - if template_end == -1: - template_end = 0 + if template_end == -1: + template_end = 0 + elif out.shape[1] > (template_end + 3): + if tok_pairs[template_end + 1][0] == 872: + if tok_pairs[template_end + 2][0] == 198: + template_end += 3🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@comfy/text_encoders/longcat_image.py` around lines 137 - 143, The post-loop "+3" adjustment currently runs even when template_end is still -1 and can index tok_pairs[0/1]; change the logic so the checks that inspect tok_pairs[template_end + 1] and tok_pairs[template_end + 2] only run when template_end != -1 (i.e., the loop actually found the <|im_start|> marker). Concretely, wrap the entire if-block that tests out.shape and tok_pairs[...] with a guard like "if template_end != -1 and out.shape[1] > (template_end + 3):", leaving the existing fallback that sets template_end = 0 after that. This ensures tok_pairs and template_end adjustments only occur when template_end was set by the earlier search.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@comfy/text_encoders/longcat_image.py`:
- Around line 137-143: The post-loop "+3" adjustment currently runs even when
template_end is still -1 and can index tok_pairs[0/1]; change the logic so the
checks that inspect tok_pairs[template_end + 1] and tok_pairs[template_end + 2]
only run when template_end != -1 (i.e., the loop actually found the <|im_start|>
marker). Concretely, wrap the entire if-block that tests out.shape and
tok_pairs[...] with a guard like "if template_end != -1 and out.shape[1] >
(template_end + 3):", leaving the existing fallback that sets template_end = 0
after that. This ensures tok_pairs and template_end adjustments only occur when
template_end was set by the earlier search.
In `@tests-unit/comfy_test/model_detection_test.py`:
- Around line 7-36: The test helper _make_longcat_comfyui_sd creates very large
tensors (H = 3072) causing multi-GB allocations; change H to a small value
(e.g., 32) and update any tensor shapes that use H (all occurrences in sd keys
like "img_in.weight", "img_in.bias", "txt_in.weight", "txt_in.bias",
"time_in.*", "final_layer.*", and the loops creating "double_blocks.{i}..." and
"single_blocks.{i}...") so the detection logic still sees correct dimensionality
but with tiny allocations; leave C_CTX and loop counts unchanged so
context_in_dim inference via txt_in.weight.shape[1] remains the same.
- Around line 39-59: The test helper _make_flux_schnell_comfyui_sd hardcodes
txt_in.weight with shape (H, 4096) which conflates H with context dimension;
introduce a separate variable (e.g., CONTEXT_IN = 4096) and allocate
txt_in.weight as torch.empty(H, CONTEXT_IN) (and use CONTEXT_IN wherever the
code should reflect the context/input embedding width), leaving H = 3072 for
channel/hidden sizes—this ensures context_in_dim is read correctly from
txt_in.weight.shape[1] and avoids mixing H and context dimensions.
ℹ️ Review info
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
blueprints/Text to Image (LongCat-Image).jsoncomfy/model_detection.pycomfy/supported_models.pycomfy/text_encoders/longcat_image.pycomfy_extras/nodes_longcat_image.pytests-unit/comfy_test/model_detection_test.py
🚧 Files skipped from review as they are similar to previous changes (2)
- comfy/model_detection.py
- blueprints/Text to Image (LongCat-Image).json
LongCat-Image ComfyUI Port
Adds native support for
LongCat-Image,
a Flux-based text-to-image model by Meituan, to ComfyUI.
Architecture
LongCat-Image is a Flux variant with:
(t=1.0, y=512.0, x=512.0)Key implementation details
Pre-converted weights
The original LongCat-Image weights use HuggingFace Diffusers key names.
ComfyUI requires pre-converted weights in its native Flux format. A standalone
download_original.shandconvert_original_to_comfy.pyscripts (hosted alongside the weights in the Comfy-Org HF repo)performs the one-time conversion:
x_embedder→img_in,context_embedder→txt_in,transformer_blocks→double_blocks,single_transformer_blocks→single_blocks)norm_out.linearweights — HuggingFace'sAdaLayerNormContinuousstores[scale | shift]while ComfyUI'sLastLayerexpects
[shift | scale]Pre-converting avoids runtime
torch.catallocations, enabling ComfyUI'szero-copy-from-disk memory mapping where tensors are referenced directly from
the safetensors file without loading into RAM.
Model detection
Pre-converted weights go through the standard Flux detection path. LongCat-Image
is distinguished from other Flux variants by a heuristic at the end of Flux
detection:
context_in_dim == 3584(fromtxt_in.weightshape) andvec_in_dim is None(novector_inlayer). This setstxt_ids_dims = [1, 2],matching the
LongCatImageconfig. The detection algorithm inmodel_config_from_unet_configselects the most specific match (highestunet_configkey count) rather than first match, soLongCatImage(5 configkeys) always wins over
FluxSchnell(2 config keys) regardless of list order.Tokenizer
LongCatImageBaseTokenizerapplies the Qwen2.5 chat template, handlescharacter-level tokenization for quoted text via
split_quotation, and pads toa fixed
max_length=512to match the expected input format.CFG renormalization
The
CFGRenormLongCatImagenode applies per-patch CFG renormalization viasampler_post_cfg_function. It reshapes to Flux's packed patch format, computesper-patch L2 norms, clamps the scale factor, and reshapes back.
No guidance embedding
Unlike standard Flux, LongCat-Image does not use a guidance conditioning tensor.
LongCatImage.extra_condsremoves theguidancekey.Known differences from HuggingFace
which rounds pad token embeddings to identical vectors. ComfyUI runs in
float32, preserving small differences from causal attention and RoPE — each
pad position gets a slightly different vector. This does not affect output
quality since the attention mask zeros out pad tokens during the diffusion
transformer.
FlowMatchEulerDiscreteSchedulerwithdynamic shifting (
use_dynamic_shifting=True), computing amuparametervia linear interpolation based on image sequence length. ComfyUI's
ModelSamplingFluxuses a staticshift=1.15withflux_time_shift,producing a slightly different sigma schedule for the same number of steps.
Files
comfy/supported_models.pyLongCatImageconfig and detection matchingcomfy/model_base.pyLongCatImagemodel class with MRoPE shiftscomfy/model_detection.pycomfy/text_encoders/longcat_image.pycomfy_extras/nodes_longcat_image.pyCLIPTextEncodeLongCatImage,CFGRenormLongCatImagenodesuser_templates/longcat_image_t2i.jsonblueprints/Text to Image (LongCat-Image).jsontests-unit/comfy_test/model_detection_test.py