Skip to content

Conversation

dxqb
Copy link

@dxqb dxqb commented Oct 16, 2025

Even if the qweight_type is one of the UNQUANTIZED_TYPES, qweight still has to be "dequantized" because it is stored as an 8-bit tensor. Without doing so, it is a shape mismatch in the following matmul.

Side notes:

  • https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?
  • thank you for this GGUF implementation! With torch 2.8 and torch.compile it is fast even without specialized kernels. torch 2.7 fails to compile the native dequantization code for some reason, so I couldn't directly compare torch.compile with the custom kernel.

Who can review?

@DN6 @Isotr0py

Even if the `qweight_type` is one of the `UNQUANTIZED_TYPES`, qweight still has to be "dequantized" because it is stored as an 8-bit tensor. Without doing so, it is therefore a shape mismatch in the following matmul.

Side notes:
 - why isn't DIFFUSERS_GGUF_CUDA_KERNELS on by default? It's significantly faster and only used when installed
 - https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?
@Isotr0py
Copy link
Contributor

https://huggingface.co/Isotr0py/ggml/tree/main/build has no build for torch 2.8 (or the upcoming 2.9). Who can we contact to make such a build?

In fact, there is a pre-release building for torch 2.8 (https://huggingface.co/Isotr0py/ggml/tree/shmem-mmq/build), but I found there is some regression about kernel size and performance in these kernels.

Anyway, I have found out the root issue about the regression and fixing it, and will make a release with torch2.8 and 2.9 support tonight.

Comment on lines +82 to +83
weight = dequantize_gguf_tensor(qweight)
return x @ weight.T
Copy link
Contributor

@Isotr0py Isotr0py Oct 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems dequantize_gguf_tensor missed implementation for FP16 and FP32 qweight:

dequantize_functions = {
gguf.GGMLQuantizationType.IQ4_NL: dequantize_blocks_IQ4_NL,
gguf.GGMLQuantizationType.IQ4_XS: dequantize_blocks_IQ4_XS,
gguf.GGMLQuantizationType.BF16: dequantize_blocks_BF16,
gguf.GGMLQuantizationType.Q8_0: dequantize_blocks_Q8_0,
gguf.GGMLQuantizationType.Q5_1: dequantize_blocks_Q5_1,
gguf.GGMLQuantizationType.Q5_0: dequantize_blocks_Q5_0,
gguf.GGMLQuantizationType.Q4_1: dequantize_blocks_Q4_1,
gguf.GGMLQuantizationType.Q4_0: dequantize_blocks_Q4_0,
gguf.GGMLQuantizationType.Q6_K: dequantize_blocks_Q6_K,
gguf.GGMLQuantizationType.Q5_K: dequantize_blocks_Q5_K,
gguf.GGMLQuantizationType.Q4_K: dequantize_blocks_Q4_K,
gguf.GGMLQuantizationType.Q3_K: dequantize_blocks_Q3_K,
gguf.GGMLQuantizationType.Q2_K: dequantize_blocks_Q2_K,
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was bf16 in my use case.

fp16 and fp32 would fail in any case, whether native or dequant kernels are used.
this PR therefore currently only fixes the bf16 case for kernel dequant - for native bf16 already works

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants