Skip to content

Flux.2 with Lora error: "mul_cuda" not implemented for 'Float8_e4m3fn' #10910

@RodriMora

Description

@RodriMora

Custom Node Testing

Expected Behavior

The image should generate using the lora

Actual Behavior

When adding a "Lora loader only" after the Load Diffusion model, i get this error:
"mul_cuda" not implemented for 'Float8_e4m3fn'

Steps to Reproduce

Using this workflow works: https://comfyanonymous.github.io/ComfyUI_examples/flux2/

Adding the Lora Model Loader only after the load model diffuser to that workflow give the error

Debug Logs

Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
Using MixedPrecisionOps for text encoder: 210 quantized layers
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 30385.05 MB usable, 17180.59 MB loaded, full load: True
Found quantization metadata (version 1.0)
Detected mixed precision quantization: 128 layers quantized
Using mixed precision operations: 128 quantized layers
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Warning: TAESD previews enabled, but could not find models/vae_approx/None
Requested to load Flux2
QuantizedTensor: Unhandled operation aten.add_.Tensor, falling back to dequantization. kwargs={}
ERROR lora diffusion_model.single_blocks.9.linear1.weight Promotion for Float8 Types is not supported, attempted to promote BFloat16 and Float8_e4m3fn
QuantizedTensor: Unhandled operation aten.slice.Tensor, falling back to dequantization. kwargs={}
!!! Exception during processing !!! "mul_cuda" not implemented for 'Float8_e4m3fn'
Traceback (most recent call last):
  File "/home/ubuntuai/ComfyUI/execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
  File "/home/ubuntuai/ComfyUI/execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
  File "/home/ubuntuai/ComfyUI/execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/home/ubuntuai/ComfyUI/execution.py", line 286, in process_inputs
    result = f(**inputs)
  File "/home/ubuntuai/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 835, in sample
    samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
  File "/home/ubuntuai/ComfyUI/comfy/samplers.py", line 1035, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
  File "/home/ubuntuai/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/samplers.py", line 984, in outer_sample
    self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
  File "/home/ubuntuai/ComfyUI/comfy/sampler_helpers.py", line 130, in prepare_sampling
    return executor.execute(model, noise_shape, conds, model_options=model_options)
  File "/home/ubuntuai/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/sampler_helpers.py", line 138, in _prepare_sampling
    comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required + inference_memory, minimum_memory_required=minimum_memory_required + inference_memory)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 701, in load_models_gpu
    loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 506, in model_load
    self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 536, in model_use_more_vram
    return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 944, in partially_load
    raise e
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 941, in partially_load
    self.load(device_to, lowvram_model_memory=current_used + extra_memory, force_patch_weights=force_patch_weights, full_load=full_load)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 754, in load
    self.patch_weight_to_device(key, device_to=device_to)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 630, in patch_weight_to_device
    out_weight = comfy.float.stochastic_rounding(out_weight, weight.dtype, seed=string_to_seed(key))
  File "/home/ubuntuai/ComfyUI/comfy/float.py", line 64, in stochastic_rounding
    output[i:i+slice_size].copy_(manual_stochastic_round_to_float8(value[i:i+slice_size], dtype, generator=generator))
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 216, in __torch_dispatch__
    return cls._dequant_and_fallback(func, args, kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 227, in _dequant_and_fallback
    new_args = dequant_arg(args)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 224, in dequant_arg
    return type(arg)(dequant_arg(a) for a in arg)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 224, in <genexpr>
    return type(arg)(dequant_arg(a) for a in arg)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 222, in dequant_arg
    return arg.dequantize()
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 196, in dequantize
    return LAYOUTS[self._layout_type].dequantize(self._qdata, **self._layout_params)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 421, in dequantize
    return plain_tensor * scale
RuntimeError: "mul_cuda" not implemented for 'Float8_e4m3fn'

Other

Lora made with ai-toolkit. The generated samples with ai toolkit work fine.

Tested with a 5090 and a rtx pro 6000 using Driver Version: 575.57.08 CUDA Version: 12.9
Torch version in the python venv: torch 2.8.0.dev20250415+cu128

Metadata

Metadata

Assignees

No one assigned

    Labels

    Potential BugUser is reporting a bug. This should be tested.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions