Skip to content

Conversation

@kitaekatt
Copy link
Contributor

Summary

Fixes inference failure for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and similar nemotron_h architecture models.

Problem

The NemotronHAttention class was missing rotary positional embeddings (RoPE), causing corrupted attention scores.

Changes

  1. Add rotary_emb initialization in NemotronHAttention.__init__()
  2. Update forward() to accept positions and apply RoPE
  3. Update NemotronHAttentionDecoderLayer to pass positions to mixer

Testing

Tested with nvidia/NVIDIA-Nemotron-Nano-9B-v2 - model generates coherent output.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a critical issue where rotary positional embeddings were missing in the NemotronHAttention layers, which caused inference failures. The changes to add rotary_emb initialization and apply it in the forward pass are well-implemented and follow existing patterns in vLLM. My review includes one suggestion to improve the robustness of the RoPE initialization by explicitly using the model's data type, which will prevent potential performance issues or subtle bugs related to dtype mismatches.

rotary_dim=self.head_dim,
max_position=max_position_embeddings,
is_neox_style=True,
dtype=torch.get_default_dtype(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

To ensure the rotary embeddings are created with the correct data type that matches the rest of the model, it's better to explicitly use model_config.dtype. Using torch.get_default_dtype() can be unreliable as it might not reflect the model's actual dtype (e.g., bfloat16 or float16), potentially leading to performance degradation from runtime casting or even subtle correctness issues. A fallback to torch.get_default_dtype() is reasonable for cases where model_config might be None, such as in some testing scenarios.

Suggested change
dtype=torch.get_default_dtype(),
dtype=model_config.dtype if model_config else torch.get_default_dtype(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! Applied this suggestion in 15ad5bd. Using model_config.dtype ensures the rotary embeddings match the model's actual dtype while maintaining the fallback for testing scenarios.

kitaekatt and others added 2 commits December 10, 2025 15:44
…n layers

The NemotronHAttention class was missing rotary positional embeddings (RoPE),
causing token generation to fail despite successful model loading.

Root cause:
- NemotronHAttention.__init__() had no rotary_emb initialization
- forward() did not accept positions parameter or apply RoPE to Q, K
- NemotronHAttentionDecoderLayer.forward() did not pass positions to mixer

This fix:
1. Imports get_rope from vllm.model_executor.layers.rotary_embedding
2. Adds rotary_emb initialization in NemotronHAttention.__init__()
3. Updates forward() to accept positions and apply q, k = rotary_emb(positions, q, k)
4. Updates NemotronHAttentionDecoderLayer to pass positions to mixer
5. Adds rotary_emb.inv_freq filter in load_weights to skip computed weights

Without RoPE, attention layers operate without positional information, producing
corrupted attention scores and preventing coherent token generation.

Fixes inference failure for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and similar
nemotron_h architecture models.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address review feedback: Use model_config.dtype instead of
torch.get_default_dtype() to ensure rotary embeddings match
the model's actual dtype. Fall back to default dtype when
model_config is None (testing scenarios).

Signed-off-by: Christina <truffle@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant