-
-
Notifications
You must be signed in to change notification settings - Fork 12.1k
fix(nemotron_h): Add missing rotary positional embeddings to attention layers #30413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly addresses a critical issue where rotary positional embeddings were missing in the NemotronHAttention layers, which caused inference failures. The changes to add rotary_emb initialization and apply it in the forward pass are well-implemented and follow existing patterns in vLLM. My review includes one suggestion to improve the robustness of the RoPE initialization by explicitly using the model's data type, which will prevent potential performance issues or subtle bugs related to dtype mismatches.
| rotary_dim=self.head_dim, | ||
| max_position=max_position_embeddings, | ||
| is_neox_style=True, | ||
| dtype=torch.get_default_dtype(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To ensure the rotary embeddings are created with the correct data type that matches the rest of the model, it's better to explicitly use model_config.dtype. Using torch.get_default_dtype() can be unreliable as it might not reflect the model's actual dtype (e.g., bfloat16 or float16), potentially leading to performance degradation from runtime casting or even subtle correctness issues. A fallback to torch.get_default_dtype() is reasonable for cases where model_config might be None, such as in some testing scenarios.
| dtype=torch.get_default_dtype(), | |
| dtype=model_config.dtype if model_config else torch.get_default_dtype(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Applied this suggestion in 15ad5bd. Using model_config.dtype ensures the rotary embeddings match the model's actual dtype while maintaining the fallback for testing scenarios.
…n layers The NemotronHAttention class was missing rotary positional embeddings (RoPE), causing token generation to fail despite successful model loading. Root cause: - NemotronHAttention.__init__() had no rotary_emb initialization - forward() did not accept positions parameter or apply RoPE to Q, K - NemotronHAttentionDecoderLayer.forward() did not pass positions to mixer This fix: 1. Imports get_rope from vllm.model_executor.layers.rotary_embedding 2. Adds rotary_emb initialization in NemotronHAttention.__init__() 3. Updates forward() to accept positions and apply q, k = rotary_emb(positions, q, k) 4. Updates NemotronHAttentionDecoderLayer to pass positions to mixer 5. Adds rotary_emb.inv_freq filter in load_weights to skip computed weights Without RoPE, attention layers operate without positional information, producing corrupted attention scores and preventing coherent token generation. Fixes inference failure for nvidia/NVIDIA-Nemotron-Nano-9B-v2 and similar nemotron_h architecture models. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Address review feedback: Use model_config.dtype instead of torch.get_default_dtype() to ensure rotary embeddings match the model's actual dtype. Fall back to default dtype when model_config is None (testing scenarios). Signed-off-by: Christina <truffle@gmail.com>
15ad5bd to
5be74d3
Compare
Summary
Fixes inference failure for
nvidia/NVIDIA-Nemotron-Nano-9B-v2and similar nemotron_h architecture models.Problem
The
NemotronHAttentionclass was missing rotary positional embeddings (RoPE), causing corrupted attention scores.Changes
rotary_embinitialization inNemotronHAttention.__init__()forward()to accept positions and apply RoPENemotronHAttentionDecoderLayerto pass positions to mixerTesting
Tested with
nvidia/NVIDIA-Nemotron-Nano-9B-v2- model generates coherent output.