fix(gguf): Extract attn_logit_softcapping from GGUF metadata #30427

kitaekatt · 2025-12-10T21:27:36Z

Summary

Fixes garbage output from Gemma2 GGUF models by extracting the attn_logit_softcapping parameter from GGUF metadata and patching it onto the HuggingFace config.

Root Cause

GGUF models store softcap in metadata with arch-specific keys (e.g., gemma2.attn_logit_softcapping), but this wasn't being extracted and applied to the HF config. Without softcap, the V1 FlashAttention backend uses softcap=0 (disabled), causing numerical instability and garbage output.

The attention backends ALREADY support softcap correctly - this fix is in the GGUF config mapping layer, not the attention kernels.

Changes

Add extract_softcap_from_gguf() to read softcap from GGUF metadata
Update maybe_patch_hf_config_from_gguf() to apply softcap values
Support both attn_logit_softcapping and final_logit_softcapping

Testing

Tested with: google/gemma-2-2b-it:Q4_K_M - produces coherent output instead of garbage

Code Review

This pull request correctly extracts attn_logit_softcapping and final_logit_softcapping from GGUF metadata to fix an issue with Gemma2 models producing garbage output. The changes are well-implemented. I have one suggestion to improve error reporting by changing a debug log to a warning, which will make it easier for users to diagnose issues with corrupted or misconfigured GGUF files.

gemini-code-assist · 2025-12-10T21:29:01Z

vllm/transformers_utils/gguf_utils.py

+    except Exception as e:
+        logger.debug("Error extracting softcap from GGUF: %s", e)
+        return {}


The broad except Exception combined with logger.debug can hide important errors, such as a corrupted GGUF file or incorrect metadata types. This could lead to silent failures where the model produces incorrect output without a clear warning, which is the problem this PR aims to solve. It's better to log such errors at a warning level to alert the user and include the model path for easier debugging.

Suggested change

except Exception as e:

logger.debug("Error extracting softcap from GGUF: %s", e)

return {}

except Exception as e:

logger.warning("Error extracting softcap from GGUF '%s': %s", model, e)

return {}

kitaekatt · 2025-12-15T16:58:40Z

Technical Debt Note

This PR implements a vLLM-side workaround for incomplete Transformers GGUF config extraction.

Root Cause

The HuggingFace Transformers library's GGUF_CONFIG_MAPPING doesn't include attn_logit_softcapping for Gemma2/Gemma3 architectures. When loading GGUF models, this critical parameter isn't extracted from GGUF metadata, causing models to use incorrect default values.

Upstream Fix

I've submitted a fix to Transformers: huggingface/transformers#42881

The upstream fix adds the missing mapping:

"attention.logit_softcapping": "attn_logit_softcapping"

Cleanup Plan

Once huggingface/transformers#42881 is merged and released:

This vLLM workaround can be removed
The parameter will be properly extracted via Transformers' standard GGUF loading

Timeline

Transformers PR submitted: 2025-12-15
Expected merge: TBD (pending review)

kitaekatt · 2025-12-15T17:18:51Z

Update: Upstream Transformers fix submitted: huggingface/transformers#42881

Once merged, this vLLM workaround can be removed as Transformers will properly extract attn_logit_softcapping from GGUF metadata.

kitaekatt · 2025-12-15T17:27:09Z

Addressed @gemini-code-assist feedback in 3297f67:

Changed logger.debug to logger.warning
Added model path to log message for easier debugging

Thanks for the suggestion!

gemini-code-assist · 2025-12-15T17:27:19Z

That's great to hear, @kitaekatt! I appreciate you addressing the feedback and incorporating the logger.warning and the model path for improved debugging. Thank you!

chatgpt-codex-connector · 2025-12-15T20:30:44Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Address gemini-code-assist feedback: only catch file-not-found errors, not all OSError types (like permission denied). Check for FileNotFoundError or HuggingFace Hub's 'does not appear to have a file named' message before suppressing the exception. Signed-off-by: Christina Brundage <christina@katt.dev> Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit bf3e3b9)

The GGUF loader yields quantization metadata parameters (qweight_type) for all quantized tensors, including embeddings. However, VocabParallelEmbedding doesn't have these parameters, causing a KeyError when loading GGUF Gemma2 models. This adds a safety check to skip parameters not present in the model, matching the pattern already used in llama.py (lines 502-503). Fixes KeyError: 'embed_tokens.qweight_type' during engine core init. Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit df8cfb1)

The Gemma2 model was missing the quant_config parameter in the VocabParallelEmbedding initialization, causing GGUF quantized embeddings to be misinterpreted as float values. Without quant_config, GGUF models use UnquantizedEmbeddingMethod which calls F.embedding() directly on quantized bytes, resulting in garbage output during inference. This is the same bug that was fixed for DeepSeek in commit aa375dc ("Missing quant_config in deepseek embedding layer (vllm-project#12836)"). The fix adds: - quant_config parameter to enable GGUFEmbeddingMethod selection - prefix parameter for proper weight mapping Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage output like: " GHFW側から ThinkmariKeywords!")... Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 5837f8e)

Address gemini-code-assist feedback: use logger.warning instead of logger.debug to alert users of potential GGUF metadata extraction issues. Include model path in log message for easier debugging. Signed-off-by: Christina Brundage <christina@katt.dev> Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit f4de809)

…bility Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>

Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>

Replace .endswith(".gguf") checks with check_gguf_file() which properly detects GGUF files via magic bytes ("GGUF" header). This fixes GGUF detection for HuggingFace blob paths which don't have .gguf extension. Affected functions: - detect_gguf_multimodal() - extract_softcap_from_gguf() Signed-off-by: Christina <christina.hall@example.com> Signed-off-by: Christina <truffle@gmail.com>

hmellor · 2025-12-16T12:34:04Z

I'm going to close this in favour of your 1 liner in Transformers. This is quite a large change in vLLM and the solution is Transformers is elegant.

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

kitaekatt mentioned this pull request Dec 15, 2025

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Closed

4 tasks

kitaekatt mentioned this pull request Dec 15, 2025

[GGUF] Add attn_logit_softcapping to Gemma2/Gemma3 config mapping huggingface/transformers#42881

Draft

kitaekatt force-pushed the fix/gemma2-gguf-softcap branch from 3297f67 to f4de809 Compare December 15, 2025 20:30

kitaekatt marked this pull request as ready for review December 15, 2025 20:30

kitaekatt added 4 commits December 15, 2025 14:31

kitaekatt marked this pull request as draft December 15, 2025 20:37

kitaekatt added 3 commits December 15, 2025 14:44

fix(gguf): Ensure Gemma2 configs have hidden_act for backward compati…

afc49e5

…bility Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>

fix(gguf): Auto-select compatible dtype for GGUF models on Blackwell

7cf52e0

Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>

fix(gguf): Skip lm_head mapping for models with tied word embeddings

0ce6ba4

Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>

kitaekatt force-pushed the fix/gemma2-gguf-softcap branch from f4de809 to 5613ad7 Compare December 15, 2025 20:47

kitaekatt force-pushed the fix/gemma2-gguf-softcap branch from 5613ad7 to 5ab69f5 Compare December 15, 2025 21:23

hmellor closed this Dec 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(gguf): Extract attn_logit_softcapping from GGUF metadata #30427

fix(gguf): Extract attn_logit_softcapping from GGUF metadata #30427

Uh oh!

kitaekatt commented Dec 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 10, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

gemini-code-assist bot commented Dec 15, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

hmellor commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

fix(gguf): Extract attn_logit_softcapping from GGUF metadata #30427

fix(gguf): Extract attn_logit_softcapping from GGUF metadata #30427

Uh oh!

Conversation

kitaekatt commented Dec 10, 2025

Summary

Root Cause

Changes

Testing

Related

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt commented Dec 15, 2025

Technical Debt Note

Root Cause

Upstream Fix

Cleanup Plan

Timeline

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

gemini-code-assist bot commented Dec 15, 2025

Uh oh!

chatgpt-codex-connector bot commented Dec 15, 2025

Uh oh!

hmellor commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants