Skip to content

Conversation

@kitaekatt
Copy link
Contributor

Summary

This PR bundles several fixes for GGUF model support, particularly targeting Blackwell (RTX 5090, SM 120+) GPUs. These fixes address dtype compatibility, config extraction, tokenizer handling, and memory synchronization issues.

Key Changes

GGUF Dtype Compatibility (Blackwell):

  • Auto-select compatible dtype for GGUF models on Blackwell GPUs
  • Default GGUF to float16 while preserving bfloat16 option for compatible hardware
  • Disable bfloat16 on Blackwell (SM 120+) via device capability check

GGUF Config Extraction:

  • Extract HF config from GGUF metadata for repos without config.json
  • Extract attn_logit_softcapping from GGUF metadata
  • Use EOS token ID from GGUF metadata instead of HF tokenizer
  • Ensure Gemma2 configs have hidden_act for backward compatibility

Gemma2 GGUF Support:

  • Add quant_config to embedding layer for GGUF support
  • Skip missing parameters during GGUF weight loading
  • Skip lm_head mapping for models with tied word embeddings

Memory & IPC Fixes:

  • Add memory barriers for cross-process shared memory visibility
  • Lazy tokenizer init in StructuredOutputManager to prevent semaphore leak

Model Fixes:

  • Add rotary positional embeddings to Nemotron-H attention layers

Test plan

  • Test GGUF models (gemma-3-1b-it, phi35-mini-gguf) on Blackwell GPU
  • Verify dtype auto-detection works correctly
  • Test Gemma2 GGUF models with tied embeddings
  • Verify shared memory barriers don't cause regressions

Related PRs

🤖 Generated with Claude Code

kitaekatt and others added 17 commits December 10, 2025 15:28
…ager

Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit 0f27680)
…ore leak

GGUF models without precomputed merges trigger `build_merges_on_the_fly`
in the transformers library, which uses multiprocessing primitives.
When this happens in both the APIServer process (for request validation)
and the EngineCore subprocess (via StructuredOutputManager), the
subprocess leaks a semaphore, causing the server to hang indefinitely.

This change makes tokenizer initialization lazy in StructuredOutputManager:
- Tokenizer is only loaded when grammar_init() is first called
- Most inference requests don't use structured output, so the tokenizer
  in EngineCore is never loaded
- For requests that do use structured output, tokenizer is loaded on-demand

The fix resolves the following symptoms:
- Server hangs after "resource_tracker: There appear to be 1 leaked
  semaphore objects to clean up at shutdown"
- Tokenizer merges being built twice (once in APIServer, once in EngineCore)
- GGUF models failing to start even though weights load successfully

Tested with bartowski/Phi-3.5-mini-instruct-GGUF (Q5_K_M).

Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit a72d1f9)
…bility

GGUF-loaded configs may only have hidden_activation from config.json,
but Gemma2MLP model code expects hidden_act attribute. This adds a
post-processing step to copy hidden_activation to hidden_act when needed.

Fixes AttributeError: 'Gemma2Config' object has no attribute 'hidden_act'
when loading Gemma2 GGUF models.

Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit 04ceef5)
For models like Gemma2 that use tie_word_embeddings=True, the lm_head.weight
is initialized from embed_tokens weights rather than loaded separately.
Add lm_head.weight to sideload_params to allow GGUF loading to succeed
without requiring this parameter to be mapped.

Fixes: RuntimeError: Failed to map GGUF parameters (1): ['lm_head.weight']

Signed-off-by: Christina <christina@example.com>
(cherry picked from commit 9512f74)
The GGUF loader yields quantization metadata parameters (qweight_type)
for all quantized tensors, including embeddings. However,
VocabParallelEmbedding doesn't have these parameters, causing a
KeyError when loading GGUF Gemma2 models.

This adds a safety check to skip parameters not present in the model,
matching the pattern already used in llama.py (lines 502-503).

Fixes KeyError: 'embed_tokens.qweight_type' during engine core init.

(cherry picked from commit 1c144b2)
The Gemma2 model was missing the quant_config parameter in the
VocabParallelEmbedding initialization, causing GGUF quantized
embeddings to be misinterpreted as float values.

Without quant_config, GGUF models use UnquantizedEmbeddingMethod
which calls F.embedding() directly on quantized bytes, resulting
in garbage output during inference.

This is the same bug that was fixed for DeepSeek in commit aa375dc
("Missing quant_config in deepseek embedding layer (vllm-project#12836)").

The fix adds:
- quant_config parameter to enable GGUFEmbeddingMethod selection
- prefix parameter for proper weight mapping

Fixes Gemma2 GGUF models (gemma-2-2b-it-GGUF, etc.) producing garbage
output like: " GHFW側から ThinkmariKeywords!")...

(cherry picked from commit d41b71f)
Fixes garbage output from Gemma2 GGUF models by extracting the
attn_logit_softcapping parameter from GGUF metadata and patching
it onto the HuggingFace config.

Root cause: GGUF models store softcap in metadata with arch-specific
keys (e.g., gemma2.attn_logit_softcapping), but this wasn't being
extracted and applied to the HF config. Without softcap, the V1
FlashAttention backend uses softcap=0 (disabled), causing numerical
instability and garbage output.

Changes:
- Add extract_softcap_from_gguf() to read softcap from GGUF metadata
- Update maybe_patch_hf_config_from_gguf() to apply softcap values
- Support both attn_logit_softcapping and final_logit_softcapping

Tested with: google/gemma-2-2b-it:Q4_K_M

(cherry picked from commit b9e724d)
GGUF dequantization kernels use half precision (fp16) internally via the
`dfloat` typedef. On Blackwell GPUs (sm_120), using bfloat16 causes garbage
output due to dtype mismatch.

Approach taken (middle ground):
- arg_utils.py: Auto-set dtype to float16 when dtype="auto" for GGUF
- gguf.py: Keep bfloat16 in supported_act_dtypes for explicit override

This defaults to safe behavior while preserving user control. Users on
hardware where bfloat16 works can still use --dtype bfloat16 explicitly.

Options considered:
1. Blanket removal of bfloat16 from GGUF - rejected (breaks working configs)
2. Blackwell-specific detection - rejected (maintenance burden, edge cases)
3. Default fp16 + allow explicit bf16 - chosen (simple, safe, preserves choice)

Tested on RTX 5090 (sm_120) with Qwen3-4B-GGUF: 583.8 tok/s

Signed-off-by: Christina <truffle@gmail.com>
…ity check

Instead of removing bfloat16 support globally, use device capability
detection to disable bfloat16 only on SM 120+ devices (Blackwell).

This preserves bfloat16 support on older architectures where tests show
it works correctly, while preventing precision issues on Blackwell.

Co-Authored-By: Isotr0py <isotr0py@users.noreply.github.com>
Signed-off-by: Christina <truffle@gmail.com>
Per review feedback: the arg_utils.py dtype override breaks Gemma2 GGUF
which doesn't support FP16. The Blackwell-specific bfloat16 restriction
in gguf.py's get_supported_act_dtypes() is sufficient - let
_resolve_auto_dtype handle dtype selection automatically.

Signed-off-by: Christina <truffle@gmail.com>
Fixes Gemma3 GGUF models failing on Blackwell GPUs with --dtype auto.

Problem:
- Gemma3 blocks float16 (numerical instability)
- GGUF on Blackwell blocks bfloat16 (precision issues)
- Only float32 works, but dtype=auto picks bfloat16 → fails

Changes:
1. gguf.py: Block bfloat16 on SM 120+ (Blackwell) devices
2. vllm.py: Auto-select compatible dtype when model and quantization
   restrictions conflict, instead of failing with an error

This allows --dtype auto to work correctly with Gemma3 GGUF on Blackwell
by automatically falling back to float32.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Change has_device_capability(120) to (100) for Blackwell SM 10.0
- Update comment and warning message to correctly reference SM 10.0
- Remove redundant torch import (already imported at file top)

Signed-off-by: Christina <truffle@gmail.com>
GGUF files store the correct EOS token ID in tokenizer.ggml.eos_token_id
metadata. However, vLLM was using the HuggingFace tokenizer's eos_token_id,
which can differ from the GGUF value.

This causes generation to not stop properly for models like Gemma 3, where:
- GGUF metadata specifies EOS token ID 106 (<end_of_turn>)
- HF tokenizer reports EOS token ID 1 (<eos>)

The model generates <end_of_turn> to signal completion, but vLLM waits for
token ID 1 which never comes, resulting in repeated EOS tokens until
max_tokens is reached.

Changes:
- Add extract_eos_token_id_from_gguf() in gguf_utils.py to read EOS from GGUF
- Patch tokenizer.eos_token_id in hf.py when loading GGUF tokenizers

Signed-off-by: Christina Zhu <christina.zhu@hotmail.com>
Signed-off-by: Christina <truffle@gmail.com>
(cherry picked from commit d8cf5b7)
…nfig.json

This enables vLLM to load GGUF models from repositories that don't include
config.json (e.g., bartowski repos) by extracting the configuration values
directly from GGUF metadata.

Changes:
- Add GGUF_ARCH_TO_HF_MODEL_TYPE mapping for architecture name translation
- Add extract_hf_config_from_gguf() function that reads GGUF metadata and
  constructs a HuggingFace-compatible config dictionary
- Add GGUFConfigParser class that uses the extraction function
- Register "gguf" format in the config parser system
- Update auto-detection to use GGUF parser for local GGUF files without
  config.json

Extracted metadata fields:
- Architecture (model_type)
- hidden_size, intermediate_size, num_hidden_layers
- num_attention_heads, num_key_value_heads
- max_position_embeddings, rope_theta, rms_norm_eps
- sliding_window, vocab_size
- bos_token_id, eos_token_id
- attn_logit_softcapping, final_logit_softcapping (for Gemma2)

Signed-off-by: Christina <truffle@gmail.com>
The shared memory ring buffer protocol in shm_broadcast.py uses plain byte
writes to signal between writer and reader processes. On multi-core systems,
these writes may stay in CPU store buffers and not be visible to other
processes running on different cores, causing indefinite spinning/freeze
under sustained concurrent load.

This patch adds explicit memory barriers using threading.Lock acquire/release
(which provides full barrier semantics per POSIX.1-2008) at four critical
points:
- In acquire_write(): before reading flags and after setting written flag
- In acquire_read(): before reading flags and after setting read flag

The memory barrier ensures that:
1. All stores before the barrier are globally visible
2. All loads after the barrier see the latest values

Fixes freeze observed during sustained concurrent batch inference (~500+
requests) where both writer and readers would spin indefinitely waiting
for flags that were updated but not visible across CPU cores.

Signed-off-by: Christina Holland <hey@christinaholland.com>
Signed-off-by: Christina <truffle@gmail.com>
…ager

Signed-off-by: Christina <truffle@gmail.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of fixes and improvements for GGUF model support, with a particular focus on compatibility with new Blackwell GPUs. The changes address several key areas including data type compatibility, configuration extraction from GGUF metadata, tokenizer handling, and memory synchronization for inter-process communication. The implementation of memory fences for shared memory is a critical fix for ensuring correctness in a multi-process environment. Additionally, the lazy initialization of the tokenizer effectively resolves a resource leak issue. The new GGUF config parser enhances usability by allowing models without a config.json to be loaded seamlessly. The model-specific adjustments for Gemma2 and Nemotron-H are also well-implemented. Overall, this is a high-quality pull request with significant and well-executed improvements.

@mergify
Copy link

mergify bot commented Dec 12, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @kitaekatt.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 12, 2025
@kitaekatt
Copy link
Contributor Author

Closing this bundle PR. All fixes have their own focused PRs: #30409, #30410, #30411, #30412, #30413, #30424, #30427, #30434, #30500, #30699. PRs #30407 and #30408 are already merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

1 participant