fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

kitaekatt · 2025-12-11T16:37:07Z

Summary

This PR bundles several fixes for GGUF model support, particularly targeting Blackwell (RTX 5090, SM 120+) GPUs. These fixes address dtype compatibility, config extraction, tokenizer handling, and memory synchronization issues.

Key Changes

GGUF Dtype Compatibility (Blackwell):

Auto-select compatible dtype for GGUF models on Blackwell GPUs
Default GGUF to float16 while preserving bfloat16 option for compatible hardware
Disable bfloat16 on Blackwell (SM 120+) via device capability check

GGUF Config Extraction:

Extract HF config from GGUF metadata for repos without config.json
Extract attn_logit_softcapping from GGUF metadata
Use EOS token ID from GGUF metadata instead of HF tokenizer
Ensure Gemma2 configs have hidden_act for backward compatibility

Gemma2 GGUF Support:

Add quant_config to embedding layer for GGUF support
Skip missing parameters during GGUF weight loading
Skip lm_head mapping for models with tied word embeddings

Memory & IPC Fixes:

Add memory barriers for cross-process shared memory visibility
Lazy tokenizer init in StructuredOutputManager to prevent semaphore leak

Model Fixes:

Add rotary positional embeddings to Nemotron-H attention layers

Test plan

Test GGUF models (gemma-3-1b-it, phi35-mini-gguf) on Blackwell GPU
Verify dtype auto-detection works correctly
Test Gemma2 GGUF models with tied embeddings
Verify shared memory barriers don't cause regressions

Related PRs

fix: Force float16 dtype for GGUF models to fix incorrect output #30090: Default GGUF to float16
[BugFix] Lazy tokenizer init in StructuredOutputManager to prevent GGUF semaphore leak #30284: Tokenizer semaphore leak fix
fix(shm): Add memory barriers for cross-process shared memory visibility #29819: Shared memory barriers

🤖 Generated with Claude Code

…ager Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 0f27680)

…ore leak GGUF models without precomputed merges trigger `build_merges_on_the_fly` in the transformers library, which uses multiprocessing primitives. When this happens in both the APIServer process (for request validation) and the EngineCore subprocess (via StructuredOutputManager), the subprocess leaks a semaphore, causing the server to hang indefinitely. This change makes tokenizer initialization lazy in StructuredOutputManager: - Tokenizer is only loaded when grammar_init() is first called - Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded - For requests that do use structured output, tokenizer is loaded on-demand The fix resolves the following symptoms: - Server hangs after "resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" - Tokenizer merges being built twice (once in APIServer, once in EngineCore) - GGUF models failing to start even though weights load successfully Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M). Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit a72d1f9)

…bility GGUF-loaded configs may only have hidden_activation from config.json, but Gemma2MLP model code expects hidden_act attribute. This adds a post-processing step to copy hidden_activation to hidden_act when needed. Fixes AttributeError: 'Gemma2Config' object has no attribute 'hidden_act' when loading Gemma2 GGUF models. Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 04ceef5)

For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. Add lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped. Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] Signed-off-by: Christina <christina@example.com> (cherry picked from commit 9512f74)

The GGUF loader yields quantization metadata parameters (qweight_type) for all quantized tensors, including embeddings. However, VocabParallelEmbedding doesn't have these parameters, causing a KeyError when loading GGUF Gemma2 models. This adds a safety check to skip parameters not present in the model, matching the pattern already used in llama.py (lines 502-503). Fixes KeyError: 'embed_tokens.qweight_type' during engine core init. (cherry picked from commit 1c144b2)

The Gemma2 model was missing the quant_config parameter in the VocabParallelEmbedding initialization, causing GGUF quantized embeddings to be misinterpreted as float values. Without quant_config, GGUF models use UnquantizedEmbeddingMethod which calls F.embedding() directly on quantized bytes, resulting in garbage output during inference. This is the same bug that was fixed for DeepSeek in commit aa375dc ("Missing quant_config in deepseek embedding layer (vllm-project#12836)"). The fix adds: - quant_config parameter to enable GGUFEmbeddingMethod selection - prefix parameter for proper weight mapping Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage output like: " GHFW側から ThinkmariKeywords!")... (cherry picked from commit d41b71f)

Fixes garbage output from Gemma2 GGUF models by extracting the attn_logit_softcapping parameter from GGUF metadata and patching it onto the HuggingFace config. Root cause: GGUF models store softcap in metadata with arch-specific keys (e.g., gemma2.attn_logit_softcapping), but this wasn't being extracted and applied to the HF config. Without softcap, the V1 FlashAttention backend uses softcap=0 (disabled), causing numerical instability and garbage output. Changes: - Add extract_softcap_from_gguf() to read softcap from GGUF metadata - Update maybe_patch_hf_config_from_gguf() to apply softcap values - Support both attn_logit_softcapping and final_logit_softcapping Tested with: google/gemma-2-2b-it:Q4_K_M (cherry picked from commit b9e724d)

GGUF dequantization kernels use half precision (fp16) internally via the `dfloat` typedef. On Blackwell GPUs (sm_120), using bfloat16 causes garbage output due to dtype mismatch. Approach taken (middle ground): - arg_utils.py: Auto-set dtype to float16 when dtype="auto" for GGUF - gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override This defaults to safe behavior while preserving user control. Users on hardware where bfloat16 works can still use --dtype bfloat16 explicitly. Options considered: 1. Blanket removal of bfloat16 from GGUF - rejected (breaks working configs) 2. Blackwell-specific detection - rejected (maintenance burden, edge cases) 3. Default fp16 + allow explicit bf16 - chosen (simple, safe, preserves choice) Tested on RTX 5090 (sm_120) with Qwen3-4B-GGUF: 583.8 tok/s Signed-off-by: Christina <truffle@gmail.com>

…ity check Instead of removing bfloat16 support globally, use device capability detection to disable bfloat16 only on SM 120+ devices (Blackwell). This preserves bfloat16 support on older architectures where tests show it works correctly, while preventing precision issues on Blackwell. Co-Authored-By: Isotr0py <isotr0py@users.noreply.github.com> Signed-off-by: Christina <truffle@gmail.com>

Per review feedback: the arg_utils.py dtype override breaks Gemma2 GGUF which doesn't support FP16. The Blackwell-specific bfloat16 restriction in gguf.py's get_supported_act_dtypes() is sufficient - let _resolve_auto_dtype handle dtype selection automatically. Signed-off-by: Christina <truffle@gmail.com>

Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto. Problem: - Gemma3 blocks float16 (numerical instability) - GGUF on Blackwell blocks bfloat16 (precision issues) - Only float32 works, but dtype=auto picks bfloat16 → fails Changes: 1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices 2. vllm.py: Auto-select compatible dtype when model and quantization restrictions conflict, instead of failing with an error This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell by automatically falling back to float32. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Change has_device_capability(120) to (100) for Blackwell SM 10.0 - Update comment and warning message to correctly reference SM 10.0 - Remove redundant torch import (already imported at file top) Signed-off-by: Christina <truffle@gmail.com>

GGUF files store the correct EOS token ID in tokenizer.ggml.eos_token_id metadata. However, vLLM was using the HuggingFace tokenizer's eos_token_id, which can differ from the GGUF value. This causes generation to not stop properly for models like Gemma 3, where: - GGUF metadata specifies EOS token ID 106 (<end_of_turn>) - HF tokenizer reports EOS token ID 1 (<eos>) The model generates <end_of_turn> to signal completion, but vLLM waits for token ID 1 which never comes, resulting in repeated EOS tokens until max_tokens is reached. Changes: - Add extract_eos_token_id_from_gguf() in gguf_utils.py to read EOS from GGUF - Patch tokenizer.eos_token_id in hf.py when loading GGUF tokenizers Signed-off-by: Christina Zhu <christina.zhu@hotmail.com> Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit d8cf5b7)

…nfig.json This enables vLLM to load GGUF models from repositories that don't include config.json (e.g., bartowski repos) by extracting the configuration values directly from GGUF metadata. Changes: - Add GGUF_ARCH_TO_HF_MODEL_TYPE mapping for architecture name translation - Add extract_hf_config_from_gguf() function that reads GGUF metadata and constructs a HuggingFace-compatible config dictionary - Add GGUFConfigParser class that uses the extraction function - Register "gguf" format in the config parser system - Update auto-detection to use GGUF parser for local GGUF files without config.json Extracted metadata fields: - Architecture (model_type) - hidden_size, intermediate_size, num_hidden_layers - num_attention_heads, num_key_value_heads - max_position_embeddings, rope_theta, rms_norm_eps - sliding_window, vocab_size - bos_token_id, eos_token_id - attn_logit_softcapping, final_logit_softcapping (for Gemma2) Signed-off-by: Christina <truffle@gmail.com>

The shared memory ring buffer protocol in shm_broadcast.py uses plain byte writes to signal between writer and reader processes. On multi-core systems, these writes may stay in CPU store buffers and not be visible to other processes running on different cores, causing indefinite spinning/freeze under sustained concurrent load. This patch adds explicit memory barriers using threading.Lock acquire/release (which provides full barrier semantics per POSIX.1-2008) at four critical points: - In acquire_write(): before reading flags and after setting written flag - In acquire_read(): before reading flags and after setting read flag The memory barrier ensures that: 1. All stores before the barrier are globally visible 2. All loads after the barrier see the latest values Fixes freeze observed during sustained concurrent batch inference (~500+ requests) where both writer and readers would spin indefinitely waiting for flags that were updated but not visible across CPU cores. Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com>

…ager Signed-off-by: Christina <truffle@gmail.com>

gemini-code-assist

Code Review

This pull request introduces a comprehensive set of fixes and improvements for GGUF model support, with a particular focus on compatibility with new Blackwell GPUs. The changes address several key areas including data type compatibility, configuration extraction from GGUF metadata, tokenizer handling, and memory synchronization for inter-process communication. The implementation of memory fences for shared memory is a critical fix for ensuring correctness in a multi-process environment. Additionally, the lazy initialization of the tokenizer effectively resolves a resource leak issue. The new GGUF config parser enhances usability by allowing models without a config.json to be loaded seamlessly. The model-specific adjustments for Gemma2 and Nemotron-H are also well-implemented. Overall, this is a high-quality pull request with significant and well-executed improvements.

mergify · 2025-12-12T21:47:11Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kitaekatt · 2025-12-15T15:56:14Z

Closing this bundle PR. All fixes have their own focused PRs: #30409, #30410, #30411, #30412, #30413, #30424, #30427, #30434, #30500, #30699. PRs #30407 and #30408 are already merged.

kitaekatt and others added 17 commits December 10, 2025 15:28

fix: Address review feedback - platform-agnostic docs and context man…

00c20ae

…ager Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 0f27680)

fix(nemotron_h): Add rotary positional embeddings to attention layers

cac1839

fix: Address review feedback - platform-agnostic docs and context man…

a207d14

…ager Signed-off-by: Christina <truffle@gmail.com>

mergify bot added structured-output v1 labels Dec 11, 2025

github-project-automation bot added this to Structured Output Dec 11, 2025

gemini-code-assist bot reviewed Dec 11, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 12, 2025

kitaekatt closed this Dec 15, 2025

github-project-automation bot moved this to Done in Structured Output Dec 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Uh oh!

kitaekatt commented Dec 11, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

mergify bot commented Dec 12, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Uh oh!

Conversation

kitaekatt commented Dec 11, 2025

Summary

Key Changes

Test plan

Related PRs

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

mergify bot commented Dec 12, 2025

Uh oh!

kitaekatt commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant