Skip to content

Conversation

@kitaekatt
Copy link
Contributor

Summary

Fixes garbage output from Gemma2 GGUF models by extracting the attn_logit_softcapping parameter from GGUF metadata and patching it onto the HuggingFace config.

Root Cause

GGUF models store softcap in metadata with arch-specific keys (e.g., gemma2.attn_logit_softcapping), but this wasn't being extracted and applied to the HF config. Without softcap, the V1 FlashAttention backend uses softcap=0 (disabled), causing numerical instability and garbage output.

The attention backends ALREADY support softcap correctly - this fix is in the GGUF config mapping layer, not the attention kernels.

Changes

  • Add extract_softcap_from_gguf() to read softcap from GGUF metadata
  • Update maybe_patch_hf_config_from_gguf() to apply softcap values
  • Support both attn_logit_softcapping and final_logit_softcapping

Testing

Tested with: google/gemma-2-2b-it:Q4_K_M - produces coherent output instead of garbage

Related

This is part of a series of GGUF fixes for Blackwell (SM 120+) compatibility.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly extracts attn_logit_softcapping and final_logit_softcapping from GGUF metadata to fix an issue with Gemma2 models producing garbage output. The changes are well-implemented. I have one suggestion to improve error reporting by changing a debug log to a warning, which will make it easier for users to diagnose issues with corrupted or misconfigured GGUF files.

Comment on lines 258 to 260
except Exception as e:
logger.debug("Error extracting softcap from GGUF: %s", e)
return {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The broad except Exception combined with logger.debug can hide important errors, such as a corrupted GGUF file or incorrect metadata types. This could lead to silent failures where the model produces incorrect output without a clear warning, which is the problem this PR aims to solve. It's better to log such errors at a warning level to alert the user and include the model path for easier debugging.

Suggested change
except Exception as e:
logger.debug("Error extracting softcap from GGUF: %s", e)
return {}
except Exception as e:
logger.warning("Error extracting softcap from GGUF '%s': %s", model, e)
return {}

@kitaekatt
Copy link
Contributor Author

Technical Debt Note

This PR implements a vLLM-side workaround for incomplete Transformers GGUF config extraction.

Root Cause

The HuggingFace Transformers library's GGUF_CONFIG_MAPPING doesn't include attn_logit_softcapping for Gemma2/Gemma3 architectures. When loading GGUF models, this critical parameter isn't extracted from GGUF metadata, causing models to use incorrect default values.

Upstream Fix

I've submitted a fix to Transformers: huggingface/transformers#42881

The upstream fix adds the missing mapping:

"attention.logit_softcapping": "attn_logit_softcapping"

Cleanup Plan

Once huggingface/transformers#42881 is merged and released:

  1. This vLLM workaround can be removed
  2. The parameter will be properly extracted via Transformers' standard GGUF loading

Timeline

  • Transformers PR submitted: 2025-12-15
  • Expected merge: TBD (pending review)

@kitaekatt
Copy link
Contributor Author

Update: Upstream Transformers fix submitted: huggingface/transformers#42881

Once merged, this vLLM workaround can be removed as Transformers will properly extract attn_logit_softcapping from GGUF metadata.

@kitaekatt
Copy link
Contributor Author

Addressed @gemini-code-assist feedback in 3297f67:

  • Changed logger.debug to logger.warning
  • Added model path to log message for easier debugging

Thanks for the suggestion!

@gemini-code-assist
Copy link
Contributor

That's great to hear, @kitaekatt! I appreciate you addressing the feedback and incorporating the logger.warning and the model path for improved debugging. Thank you!

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.

Address gemini-code-assist feedback: only catch file-not-found errors,
not all OSError types (like permission denied).

Check for FileNotFoundError or HuggingFace Hub's 'does not appear to
have a file named' message before suppressing the exception.

Signed-off-by: Christina Brundage <christina@katt.dev>
Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit bf3e3b9)
The GGUF loader yields quantization metadata parameters (qweight_type)
for all quantized tensors, including embeddings. However,
VocabParallelEmbedding doesn't have these parameters, causing a
KeyError when loading GGUF Gemma2 models.

This adds a safety check to skip parameters not present in the model,
matching the pattern already used in llama.py (lines 502-503).

Fixes KeyError: 'embed_tokens.qweight_type' during engine core init.

Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit df8cfb1)
The Gemma2 model was missing the quant_config parameter in the
VocabParallelEmbedding initialization, causing GGUF quantized
embeddings to be misinterpreted as float values.

Without quant_config, GGUF models use UnquantizedEmbeddingMethod
which calls F.embedding() directly on quantized bytes, resulting
in garbage output during inference.

This is the same bug that was fixed for DeepSeek in commit aa375dc
("Missing quant_config in deepseek embedding layer (vllm-project#12836)").

The fix adds:
- quant_config parameter to enable GGUFEmbeddingMethod selection
- prefix parameter for proper weight mapping

Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage
output like: " GHFW側から ThinkmariKeywords!")...

Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit 5837f8e)
Address gemini-code-assist feedback: use logger.warning instead of
logger.debug to alert users of potential GGUF metadata extraction issues.
Include model path in log message for easier debugging.

Signed-off-by: Christina Brundage <christina@katt.dev>
Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit f4de809)
@kitaekatt kitaekatt marked this pull request as draft December 15, 2025 20:37
…bility

Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>
Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>
Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>
@kitaekatt kitaekatt force-pushed the fix/gemma2-gguf-softcap branch from f4de809 to 5613ad7 Compare December 15, 2025 20:47
Replace .endswith(".gguf") checks with check_gguf_file() which properly
detects GGUF files via magic bytes ("GGUF" header). This fixes GGUF
detection for HuggingFace blob paths which don't have .gguf extension.

Affected functions:
- detect_gguf_multimodal()
- extract_softcap_from_gguf()

Signed-off-by: Christina <christina.hall@example.com>
Signed-off-by: Christina <truffle@gmail.com>
@kitaekatt kitaekatt force-pushed the fix/gemma2-gguf-softcap branch from 5613ad7 to 5ab69f5 Compare December 15, 2025 21:23
@hmellor
Copy link
Member

hmellor commented Dec 16, 2025

I'm going to close this in favour of your 1 liner in Transformers. This is quite a large change in vLLM and the solution is Transformers is elegant.

@hmellor hmellor closed this Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants