-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
fix(gguf): Extract attn_logit_softcapping from GGUF metadata #30427
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly extracts attn_logit_softcapping and final_logit_softcapping from GGUF metadata to fix an issue with Gemma2 models producing garbage output. The changes are well-implemented. I have one suggestion to improve error reporting by changing a debug log to a warning, which will make it easier for users to diagnose issues with corrupted or misconfigured GGUF files.
| except Exception as e: | ||
| logger.debug("Error extracting softcap from GGUF: %s", e) | ||
| return {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad except Exception combined with logger.debug can hide important errors, such as a corrupted GGUF file or incorrect metadata types. This could lead to silent failures where the model produces incorrect output without a clear warning, which is the problem this PR aims to solve. It's better to log such errors at a warning level to alert the user and include the model path for easier debugging.
| except Exception as e: | |
| logger.debug("Error extracting softcap from GGUF: %s", e) | |
| return {} | |
| except Exception as e: | |
| logger.warning("Error extracting softcap from GGUF '%s': %s", model, e) | |
| return {} |
Technical Debt NoteThis PR implements a vLLM-side workaround for incomplete Transformers GGUF config extraction. Root CauseThe HuggingFace Transformers library's Upstream FixI've submitted a fix to Transformers: huggingface/transformers#42881 The upstream fix adds the missing mapping: "attention.logit_softcapping": "attn_logit_softcapping"Cleanup PlanOnce huggingface/transformers#42881 is merged and released:
Timeline
|
|
Update: Upstream Transformers fix submitted: huggingface/transformers#42881 Once merged, this vLLM workaround can be removed as Transformers will properly extract |
|
Addressed @gemini-code-assist feedback in 3297f67:
Thanks for the suggestion! |
|
That's great to hear, @kitaekatt! I appreciate you addressing the feedback and incorporating the |
3297f67 to
f4de809
Compare
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Address gemini-code-assist feedback: only catch file-not-found errors, not all OSError types (like permission denied). Check for FileNotFoundError or HuggingFace Hub's 'does not appear to have a file named' message before suppressing the exception. Signed-off-by: Christina Brundage <christina@katt.dev> Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit bf3e3b9)
The GGUF loader yields quantization metadata parameters (qweight_type) for all quantized tensors, including embeddings. However, VocabParallelEmbedding doesn't have these parameters, causing a KeyError when loading GGUF Gemma2 models. This adds a safety check to skip parameters not present in the model, matching the pattern already used in llama.py (lines 502-503). Fixes KeyError: 'embed_tokens.qweight_type' during engine core init. Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit df8cfb1)
The Gemma2 model was missing the quant_config parameter in the VocabParallelEmbedding initialization, causing GGUF quantized embeddings to be misinterpreted as float values. Without quant_config, GGUF models use UnquantizedEmbeddingMethod which calls F.embedding() directly on quantized bytes, resulting in garbage output during inference. This is the same bug that was fixed for DeepSeek in commit aa375dc ("Missing quant_config in deepseek embedding layer (vllm-project#12836)"). The fix adds: - quant_config parameter to enable GGUFEmbeddingMethod selection - prefix parameter for proper weight mapping Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage output like: " GHFW側から ThinkmariKeywords!")... Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 5837f8e)
Address gemini-code-assist feedback: use logger.warning instead of logger.debug to alert users of potential GGUF metadata extraction issues. Include model path in log message for easier debugging. Signed-off-by: Christina Brundage <christina@katt.dev> Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit f4de809)
…bility Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>
Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>
Signed-off-by: Christina Hammer <christina.hammer3@gmail.com>
f4de809 to
5613ad7
Compare
Replace .endswith(".gguf") checks with check_gguf_file() which properly
detects GGUF files via magic bytes ("GGUF" header). This fixes GGUF
detection for HuggingFace blob paths which don't have .gguf extension.
Affected functions:
- detect_gguf_multimodal()
- extract_softcap_from_gguf()
Signed-off-by: Christina <christina.hall@example.com>
Signed-off-by: Christina <truffle@gmail.com>
5613ad7 to
5ab69f5
Compare
|
I'm going to close this in favour of your 1 liner in Transformers. This is quite a large change in vLLM and the solution is Transformers is elegant. |
Summary
Fixes garbage output from Gemma2 GGUF models by extracting the
attn_logit_softcappingparameter from GGUF metadata and patching it onto the HuggingFace config.Root Cause
GGUF models store softcap in metadata with arch-specific keys (e.g.,
gemma2.attn_logit_softcapping), but this wasn't being extracted and applied to the HF config. Without softcap, the V1 FlashAttention backend usessoftcap=0(disabled), causing numerical instability and garbage output.The attention backends ALREADY support softcap correctly - this fix is in the GGUF config mapping layer, not the attention kernels.
Changes
extract_softcap_from_gguf()to read softcap from GGUF metadatamaybe_patch_hf_config_from_gguf()to apply softcap valuesattn_logit_softcappingandfinal_logit_softcappingTesting
Tested with:
google/gemma-2-2b-it:Q4_K_M- produces coherent output instead of garbageRelated
This is part of a series of GGUF fixes for Blackwell (SM 120+) compatibility.