Flux.2 with Lora error: "mul_cuda" not implemented for 'Float8_e4m3fn'

### Custom Node Testing

- [ ] I have tried disabling custom nodes and the issue persists (see [how to disable custom nodes](https://docs.comfy.org/troubleshooting/custom-node-issues#step-1%3A-test-with-all-custom-nodes-disabled) if you need help)

### Expected Behavior

The image should generate using the lora

### Actual Behavior

When adding a "Lora loader only" after the Load Diffusion model, i get this error:
"mul_cuda" not implemented for 'Float8_e4m3fn' 

### Steps to Reproduce

Using this workflow works: https://comfyanonymous.github.io/ComfyUI_examples/flux2/

Adding the Lora Model Loader only after the load model diffuser to that workflow give the error

### Debug Logs

```powershell
Using pytorch attention in VAE
Using pytorch attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
[MultiGPU Core Patching] text_encoder_device_patched returning device: cuda:0 (current_text_encoder_device=cuda:0)
Using MixedPrecisionOps for text encoder: 210 quantized layers
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load Flux2TEModel_
loaded completely; 30385.05 MB usable, 17180.59 MB loaded, full load: True
Found quantization metadata (version 1.0)
Detected mixed precision quantization: 128 layers quantized
Using mixed precision operations: 128 quantized layers
model weight dtype torch.float8_e4m3fn, manual cast: torch.bfloat16
model_type FLUX
Warning: TAESD previews enabled, but could not find models/vae_approx/None
Requested to load Flux2
QuantizedTensor: Unhandled operation aten.add_.Tensor, falling back to dequantization. kwargs={}
ERROR lora diffusion_model.single_blocks.9.linear1.weight Promotion for Float8 Types is not supported, attempted to promote BFloat16 and Float8_e4m3fn
QuantizedTensor: Unhandled operation aten.slice.Tensor, falling back to dequantization. kwargs={}
!!! Exception during processing !!! "mul_cuda" not implemented for 'Float8_e4m3fn'
Traceback (most recent call last):
  File "/home/ubuntuai/ComfyUI/execution.py", line 510, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
  File "/home/ubuntuai/ComfyUI/execution.py", line 324, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, hidden_inputs=hidden_inputs)
  File "/home/ubuntuai/ComfyUI/execution.py", line 298, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "/home/ubuntuai/ComfyUI/execution.py", line 286, in process_inputs
    result = f(**inputs)
  File "/home/ubuntuai/ComfyUI/comfy_extras/nodes_custom_sampler.py", line 835, in sample
    samples = guider.sample(noise.generate_noise(latent), latent_image, sampler, sigmas, denoise_mask=noise_mask, callback=callback, disable_pbar=disable_pbar, seed=noise.seed)
  File "/home/ubuntuai/ComfyUI/comfy/samplers.py", line 1035, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
  File "/home/ubuntuai/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/samplers.py", line 984, in outer_sample
    self.inner_model, self.conds, self.loaded_models = comfy.sampler_helpers.prepare_sampling(self.model_patcher, noise.shape, self.conds, self.model_options)
  File "/home/ubuntuai/ComfyUI/comfy/sampler_helpers.py", line 130, in prepare_sampling
    return executor.execute(model, noise_shape, conds, model_options=model_options)
  File "/home/ubuntuai/ComfyUI/comfy/patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/sampler_helpers.py", line 138, in _prepare_sampling
    comfy.model_management.load_models_gpu([model] + models, memory_required=memory_required + inference_memory, minimum_memory_required=minimum_memory_required + inference_memory)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 701, in load_models_gpu
    loaded_model.model_load(lowvram_model_memory, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 506, in model_load
    self.model_use_more_vram(use_more_vram, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_management.py", line 536, in model_use_more_vram
    return self.model.partially_load(self.device, extra_memory, force_patch_weights=force_patch_weights)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 944, in partially_load
    raise e
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 941, in partially_load
    self.load(device_to, lowvram_model_memory=current_used + extra_memory, force_patch_weights=force_patch_weights, full_load=full_load)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 754, in load
    self.patch_weight_to_device(key, device_to=device_to)
  File "/home/ubuntuai/ComfyUI/comfy/model_patcher.py", line 630, in patch_weight_to_device
    out_weight = comfy.float.stochastic_rounding(out_weight, weight.dtype, seed=string_to_seed(key))
  File "/home/ubuntuai/ComfyUI/comfy/float.py", line 64, in stochastic_rounding
    output[i:i+slice_size].copy_(manual_stochastic_round_to_float8(value[i:i+slice_size], dtype, generator=generator))
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 216, in __torch_dispatch__
    return cls._dequant_and_fallback(func, args, kwargs)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 227, in _dequant_and_fallback
    new_args = dequant_arg(args)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 224, in dequant_arg
    return type(arg)(dequant_arg(a) for a in arg)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 224, in <genexpr>
    return type(arg)(dequant_arg(a) for a in arg)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 222, in dequant_arg
    return arg.dequantize()
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 196, in dequantize
    return LAYOUTS[self._layout_type].dequantize(self._qdata, **self._layout_params)
  File "/home/ubuntuai/ComfyUI/comfy/quant_ops.py", line 421, in dequantize
    return plain_tensor * scale
RuntimeError: "mul_cuda" not implemented for 'Float8_e4m3fn'
```

### Other

Lora made with ai-toolkit. The generated samples with ai toolkit work fine.

Tested with a 5090 and a rtx pro 6000 using Driver Version: 575.57.08      CUDA Version: 12.9
Torch version in the python venv: torch 2.8.0.dev20250415+cu128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flux.2 with Lora error: "mul_cuda" not implemented for 'Float8_e4m3fn' #10910

Custom Node Testing

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flux.2 with Lora error: "mul_cuda" not implemented for 'Float8_e4m3fn' #10910

Description

Custom Node Testing

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions