-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
fix(gguf): GGUF model support fixes for Blackwell GPUs #30497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ager Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 0f27680)
…ore leak GGUF models without precomputed merges trigger `build_merges_on_the_fly` in the transformers library, which uses multiprocessing primitives. When this happens in both the APIServer process (for request validation) and the EngineCore subprocess (via StructuredOutputManager), the subprocess leaks a semaphore, causing the server to hang indefinitely. This change makes tokenizer initialization lazy in StructuredOutputManager: - Tokenizer is only loaded when grammar_init() is first called - Most inference requests don't use structured output, so the tokenizer in EngineCore is never loaded - For requests that do use structured output, tokenizer is loaded on-demand The fix resolves the following symptoms: - Server hangs after "resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown" - Tokenizer merges being built twice (once in APIServer, once in EngineCore) - GGUF models failing to start even though weights load successfully Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M). Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit a72d1f9)
…bility GGUF-loaded configs may only have hidden_activation from config.json, but Gemma2MLP model code expects hidden_act attribute. This adds a post-processing step to copy hidden_activation to hidden_act when needed. Fixes AttributeError: 'Gemma2Config' object has no attribute 'hidden_act' when loading Gemma2 GGUF models. Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit 04ceef5)
For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight is initialized from embed_tokens weights rather than loaded separately. Add lm_head.weight to sideload_params to allow GGUF loading to succeed without requiring this parameter to be mapped. Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight'] Signed-off-by: Christina <christina@example.com> (cherry picked from commit 9512f74)
The GGUF loader yields quantization metadata parameters (qweight_type) for all quantized tensors, including embeddings. However, VocabParallelEmbedding doesn't have these parameters, causing a KeyError when loading GGUF Gemma2 models. This adds a safety check to skip parameters not present in the model, matching the pattern already used in llama.py (lines 502-503). Fixes KeyError: 'embed_tokens.qweight_type' during engine core init. (cherry picked from commit 1c144b2)
The Gemma2 model was missing the quant_config parameter in the VocabParallelEmbedding initialization, causing GGUF quantized embeddings to be misinterpreted as float values. Without quant_config, GGUF models use UnquantizedEmbeddingMethod which calls F.embedding() directly on quantized bytes, resulting in garbage output during inference. This is the same bug that was fixed for DeepSeek in commit aa375dc ("Missing quant_config in deepseek embedding layer (vllm-project#12836)"). The fix adds: - quant_config parameter to enable GGUFEmbeddingMethod selection - prefix parameter for proper weight mapping Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage output like: " GHFW側から ThinkmariKeywords!")... (cherry picked from commit d41b71f)
Fixes garbage output from Gemma2 GGUF models by extracting the attn_logit_softcapping parameter from GGUF metadata and patching it onto the HuggingFace config. Root cause: GGUF models store softcap in metadata with arch-specific keys (e.g., gemma2.attn_logit_softcapping), but this wasn't being extracted and applied to the HF config. Without softcap, the V1 FlashAttention backend uses softcap=0 (disabled), causing numerical instability and garbage output. Changes: - Add extract_softcap_from_gguf() to read softcap from GGUF metadata - Update maybe_patch_hf_config_from_gguf() to apply softcap values - Support both attn_logit_softcapping and final_logit_softcapping Tested with: google/gemma-2-2b-it:Q4_K_M (cherry picked from commit b9e724d)
GGUF dequantization kernels use half precision (fp16) internally via the `dfloat` typedef. On Blackwell GPUs (sm_120), using bfloat16 causes garbage output due to dtype mismatch. Approach taken (middle ground): - arg_utils.py: Auto-set dtype to float16 when dtype="auto" for GGUF - gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override This defaults to safe behavior while preserving user control. Users on hardware where bfloat16 works can still use --dtype bfloat16 explicitly. Options considered: 1. Blanket removal of bfloat16 from GGUF - rejected (breaks working configs) 2. Blackwell-specific detection - rejected (maintenance burden, edge cases) 3. Default fp16 + allow explicit bf16 - chosen (simple, safe, preserves choice) Tested on RTX 5090 (sm_120) with Qwen3-4B-GGUF: 583.8 tok/s Signed-off-by: Christina <truffle@gmail.com>
…ity check Instead of removing bfloat16 support globally, use device capability detection to disable bfloat16 only on SM 120+ devices (Blackwell). This preserves bfloat16 support on older architectures where tests show it works correctly, while preventing precision issues on Blackwell. Co-Authored-By: Isotr0py <isotr0py@users.noreply.github.com> Signed-off-by: Christina <truffle@gmail.com>
Per review feedback: the arg_utils.py dtype override breaks Gemma2 GGUF which doesn't support FP16. The Blackwell-specific bfloat16 restriction in gguf.py's get_supported_act_dtypes() is sufficient - let _resolve_auto_dtype handle dtype selection automatically. Signed-off-by: Christina <truffle@gmail.com>
Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto. Problem: - Gemma3 blocks float16 (numerical instability) - GGUF on Blackwell blocks bfloat16 (precision issues) - Only float32 works, but dtype=auto picks bfloat16 → fails Changes: 1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices 2. vllm.py: Auto-select compatible dtype when model and quantization restrictions conflict, instead of failing with an error This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell by automatically falling back to float32. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change has_device_capability(120) to (100) for Blackwell SM 10.0 - Update comment and warning message to correctly reference SM 10.0 - Remove redundant torch import (already imported at file top) Signed-off-by: Christina <truffle@gmail.com>
GGUF files store the correct EOS token ID in tokenizer.ggml.eos_token_id metadata. However, vLLM was using the HuggingFace tokenizer's eos_token_id, which can differ from the GGUF value. This causes generation to not stop properly for models like Gemma 3, where: - GGUF metadata specifies EOS token ID 106 (<end_of_turn>) - HF tokenizer reports EOS token ID 1 (<eos>) The model generates <end_of_turn> to signal completion, but vLLM waits for token ID 1 which never comes, resulting in repeated EOS tokens until max_tokens is reached. Changes: - Add extract_eos_token_id_from_gguf() in gguf_utils.py to read EOS from GGUF - Patch tokenizer.eos_token_id in hf.py when loading GGUF tokenizers Signed-off-by: Christina Zhu <christina.zhu@hotmail.com> Signed-off-by: Christina <truffle@gmail.com> (cherry picked from commit d8cf5b7)
…nfig.json This enables vLLM to load GGUF models from repositories that don't include config.json (e.g., bartowski repos) by extracting the configuration values directly from GGUF metadata. Changes: - Add GGUF_ARCH_TO_HF_MODEL_TYPE mapping for architecture name translation - Add extract_hf_config_from_gguf() function that reads GGUF metadata and constructs a HuggingFace-compatible config dictionary - Add GGUFConfigParser class that uses the extraction function - Register "gguf" format in the config parser system - Update auto-detection to use GGUF parser for local GGUF files without config.json Extracted metadata fields: - Architecture (model_type) - hidden_size, intermediate_size, num_hidden_layers - num_attention_heads, num_key_value_heads - max_position_embeddings, rope_theta, rms_norm_eps - sliding_window, vocab_size - bos_token_id, eos_token_id - attn_logit_softcapping, final_logit_softcapping (for Gemma2) Signed-off-by: Christina <truffle@gmail.com>
The shared memory ring buffer protocol in shm_broadcast.py uses plain byte writes to signal between writer and reader processes. On multi-core systems, these writes may stay in CPU store buffers and not be visible to other processes running on different cores, causing indefinite spinning/freeze under sustained concurrent load. This patch adds explicit memory barriers using threading.Lock acquire/release (which provides full barrier semantics per POSIX.1-2008) at four critical points: - In acquire_write(): before reading flags and after setting written flag - In acquire_read(): before reading flags and after setting read flag The memory barrier ensures that: 1. All stores before the barrier are globally visible 2. All loads after the barrier see the latest values Fixes freeze observed during sustained concurrent batch inference (~500+ requests) where both writer and readers would spin indefinitely waiting for flags that were updated but not visible across CPU cores. Signed-off-by: Christina Holland <hey@christinaholland.com> Signed-off-by: Christina <truffle@gmail.com>
…ager Signed-off-by: Christina <truffle@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a comprehensive set of fixes and improvements for GGUF model support, with a particular focus on compatibility with new Blackwell GPUs. The changes address several key areas including data type compatibility, configuration extraction from GGUF metadata, tokenizer handling, and memory synchronization for inter-process communication. The implementation of memory fences for shared memory is a critical fix for ensuring correctness in a multi-process environment. Additionally, the lazy initialization of the tokenizer effectively resolves a resource leak issue. The new GGUF config parser enhances usability by allowing models without a config.json to be loaded seamlessly. The model-specific adjustments for Gemma2 and Nemotron-H are also well-implemented. Overall, this is a high-quality pull request with significant and well-executed improvements.
|
This pull request has merge conflicts that must be resolved before it can be |
Summary
This PR bundles several fixes for GGUF model support, particularly targeting Blackwell (RTX 5090, SM 120+) GPUs. These fixes address dtype compatibility, config extraction, tokenizer handling, and memory synchronization issues.
Key Changes
GGUF Dtype Compatibility (Blackwell):
GGUF Config Extraction:
attn_logit_softcappingfrom GGUF metadatahidden_actfor backward compatibilityGemma2 GGUF Support:
quant_configto embedding layer for GGUF supportlm_headmapping for models with tied word embeddingsMemory & IPC Fixes:
Model Fixes:
Test plan
Related PRs
🤖 Generated with Claude Code