fix(nemotron_h): Add missing rotary positional embeddings to attention layers #30413

kitaekatt · 2025-12-10T17:46:14Z

Summary

Fixes inference failure for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and similar nemotron_h architecture models.

Problem

The NemotronHAttention class was missing rotary positional embeddings (RoPE), causing corrupted attention scores.

Changes

Add rotary_emb initialization in NemotronHAttention.__init__()
Update forward() to accept positions and apply RoPE
Update NemotronHAttentionDecoderLayer to pass positions to mixer

Testing

Tested with nvidia/NVIDIA-Nemotron-Nano-9B-v2 - model generates coherent output.

gemini-code-assist

Code Review

This pull request correctly addresses a critical issue where rotary positional embeddings were missing in the NemotronHAttention layers, which caused inference failures. The changes to add rotary_emb initialization and apply it in the forward pass are well-implemented and follow existing patterns in vLLM. My review includes one suggestion to improve the robustness of the RoPE initialization by explicitly using the model's data type, which will prevent potential performance issues or subtle bugs related to dtype mismatches.

gemini-code-assist · 2025-12-10T17:47:47Z

vllm/model_executor/models/nemotron_h.py

+            rotary_dim=self.head_dim,
+            max_position=max_position_embeddings,
+            is_neox_style=True,
+            dtype=torch.get_default_dtype(),


To ensure the rotary embeddings are created with the correct data type that matches the rest of the model, it's better to explicitly use model_config.dtype. Using torch.get_default_dtype() can be unreliable as it might not reflect the model's actual dtype (e.g., bfloat16 or float16), potentially leading to performance degradation from runtime casting or even subtle correctness issues. A fallback to torch.get_default_dtype() is reasonable for cases where model_config might be None, such as in some testing scenarios.

Suggested change

dtype=torch.get_default_dtype(),

dtype=model_config.dtype if model_config else torch.get_default_dtype(),

Good catch! Applied this suggestion in 15ad5bd. Using model_config.dtype ensures the rotary embeddings match the model's actual dtype while maintaining the fallback for testing scenarios.

…n layers The NemotronHAttention class was missing rotary positional embeddings (RoPE), causing token generation to fail despite successful model loading. Root cause: - NemotronHAttention.__init__() had no rotary_emb initialization - forward() did not accept positions parameter or apply RoPE to Q, K - NemotronHAttentionDecoderLayer.forward() did not pass positions to mixer This fix: 1. Imports get_rope from vllm.model_executor.layers.rotary_embedding 2. Adds rotary_emb initialization in NemotronHAttention.__init__() 3. Updates forward() to accept positions and apply q, k = rotary_emb(positions, q, k) 4. Updates NemotronHAttentionDecoderLayer to pass positions to mixer 5. Adds rotary_emb.inv_freq filter in load_weights to skip computed weights Without RoPE, attention layers operate without positional information, producing corrupted attention scores and preventing coherent token generation. Fixes inference failure for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and similar nemotron_h architecture models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Address review feedback: Use model_config.dtype instead of torch.get_default_dtype() to ensure rotary embeddings match the model's actual dtype. Fall back to default dtype when model_config is None (testing scenarios). Signed-off-by: Christina <truffle@gmail.com>

mergify bot mentioned this pull request Dec 10, 2025

fix(nemotron_h): Add missing rotary positional embeddings to attention layers #30406

Closed

gemini-code-assist bot reviewed Dec 10, 2025

View reviewed changes

kitaekatt and others added 2 commits December 10, 2025 15:44

kitaekatt force-pushed the fix/30406-nemotron-h-rope branch from 15ad5bd to 5be74d3 Compare December 10, 2025 21:45

This was referenced Dec 15, 2025

fix(gemma2): Skip missing parameters during GGUF weight loading #30421

Closed

fix(gguf): GGUF model support fixes for Blackwell GPUs #30497

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix(nemotron_h): Add missing rotary positional embeddings to attention layers #30413

fix(nemotron_h): Add missing rotary positional embeddings to attention layers #30413

Uh oh!

kitaekatt commented Dec 10, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Dec 10, 2025

Uh oh!

kitaekatt Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	dtype=torch.get_default_dtype(),
	dtype=model_config.dtype if model_config else torch.get_default_dtype(),

Uh oh!

fix(nemotron_h): Add missing rotary positional embeddings to attention layers #30413

Are you sure you want to change the base?

fix(nemotron_h): Add missing rotary positional embeddings to attention layers #30413

Uh oh!

Conversation

kitaekatt commented Dec 10, 2025

Summary

Problem

Changes

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

kitaekatt Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant