Skip to content

[Do not merge] Add out of place layernorm #20197

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

charlifu
Copy link
Contributor

No description provided.

charlifu added 8 commits June 19, 2025 21:46
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @charlifu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant change to the fused_add_rms_norm and related layer normalization kernels by transitioning them from in-place operations to out-of-place operations. This modification enhances flexibility and potentially improves memory management by allowing the normalized output and updated residual to be written to distinct, pre-allocated tensors. The changes span across C++/CUDA kernels, Python bindings, and the compilation pipeline, ensuring the new API is consistently applied and tested. The [Do not merge] prefix suggests this is an exploratory or work-in-progress change.

Highlights

  • API Change for Fused LayerNorm Kernels: Modified the fused_add_rms_norm and fused_add_rms_norm_static_fp8_quant operations to support out-of-place computation. This involves introducing new output tensors for the normalized result (out) and the updated residual (residual_out), rather than modifying input tensors in-place.
  • Kernel Signature Updates: Updated the C++/CUDA kernel signatures (fused_add_rms_norm_kernel, fused_add_rms_norm_static_fp8_quant_kernel, rms_norm_dynamic_per_token_quant_kernel) to accept the new out and residual_out pointers and mark original inputs as const.
  • Python Binding and Wrapper Updates: The Python _custom_ops.py and model_executor/layers/layernorm.py wrappers, along with torch_bindings.cpp, have been updated to reflect the new out-of-place API, passing and returning the new output tensors.
  • Test Adjustments: Corresponding test cases in test_layernorm.py were adjusted to allocate and pass the new output tensors, ensuring correctness with the out-of-place behavior.
  • Functionalization and Fusion Graph Updates: Changes were made in fix_functionalization.py and fusion.py to correctly handle the new output tensors in the graph compilation and optimization passes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the fused RMSNorm kernels to be out-of-place. I've found a few critical issues that need to be addressed:

  • There are bugs in the test suite where arguments to a custom op are swapped, which could lead to incorrect test results.
  • A TypeError will occur in the layernorm layer due to incorrect usage of torch.empty().

Additionally, there are several places where const correctness could be improved in the C++ and CUDA code, and a potentially impactful line of code has been commented out in the torch.compile path. Addressing these points will greatly improve the quality and robustness of the PR.

Comment on lines +180 to 185
void fused_add_rms_norm(torch::Tensor& out, // [..., hidden_size]
torch::Tensor& input, // [..., hidden_size]
torch::Tensor& residual_out, // [..., hidden_size]
torch::Tensor& residual, // [..., hidden_size]
torch::Tensor& weight, // [hidden_size]
double epsilon) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The input, residual, and weight tensors are not modified within this function; they are read-only. For const-correctness, they should be passed as const torch::Tensor&.

void fused_add_rms_norm(torch::Tensor& out,           // [..., hidden_size]
                        const torch::Tensor& input,         // [..., hidden_size]
                        torch::Tensor& residual_out,  // [..., hidden_size]
                        const torch::Tensor& residual,      // [..., hidden_size]
                        const torch::Tensor& weight,        // [hidden_size]

Comment on lines 67 to 69
scalar_t* __restrict__ input, // [..., hidden_size]
scalar_t* __restrict__ residual_out, // [..., hidden_size]
scalar_t* __restrict__ residual, // [..., hidden_size]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The input and residual pointers are only used for reading in this kernel. They should be declared as const scalar_t* __restrict__ to enforce const-correctness.

    const scalar_t* __restrict__ input,         // [..., hidden_size]
    scalar_t* __restrict__ residual_out,  // [..., hidden_size]
    const scalar_t* __restrict__ residual,      // [..., hidden_size]

Comment on lines 211 to 218
void fused_add_rms_norm_static_fp8_quant(
torch::Tensor& out, // [..., hidden_size],
torch::Tensor& input, // [..., hidden_size]
torch::Tensor& residual, // [..., hidden_size]
torch::Tensor& weight, // [hidden_size]
torch::Tensor& scale, // [1]
torch::Tensor& out, // [..., hidden_size],
torch::Tensor& input, // [..., hidden_size]
torch::Tensor& residual_out, // [..., hidden_size]
torch::Tensor& residual, // [..., hidden_size]
torch::Tensor& weight, // [hidden_size]
torch::Tensor& scale, // [1]
double epsilon) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The input, residual, weight, and scale tensors are read-only in this function. They should be passed as const torch::Tensor& to ensure they are not modified and to improve code clarity.

void fused_add_rms_norm_static_fp8_quant(
    torch::Tensor& out,           // [..., hidden_size],
    const torch::Tensor& input,         // [..., hidden_size]
    torch::Tensor& residual_out,  // [..., hidden_size]
    const torch::Tensor& residual,      // [..., hidden_size]
    const torch::Tensor& weight,        // [hidden_size]
    const torch::Tensor& scale,         // [1]
    double epsilon) {

Comment on lines +118 to 119
std::optional<torch::Tensor> residual_out,
std::optional<torch::Tensor> residual);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The residual tensor is read-only. It should be passed by const& to avoid an unnecessary copy of the std::optional and to signal its read-only nature.

Suggested change
std::optional<torch::Tensor> residual_out,
std::optional<torch::Tensor> residual);
std::optional<torch::Tensor> residual_out,
const std::optional<torch::Tensor>& residual);

Comment on lines +18 to 19
scalar_t* __restrict__ residual_out = nullptr,
scalar_t* __restrict__ residual = nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The residual pointer is only read from in this function. For const-correctness, it should be declared as const scalar_t* __restrict__.

    scalar_t* __restrict__ residual_out = nullptr,
    const scalar_t* __restrict__ residual = nullptr) {

Comment on lines +103 to 104
scalar_t* __restrict__ residual_out = nullptr,
scalar_t* __restrict__ residual = nullptr) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The residual pointer is only read from in this function. It should be declared as const scalar_t* __restrict__ to enforce const-correctness.

                               scalar_t* __restrict__ residual_out = nullptr,
                               const scalar_t* __restrict__ residual = nullptr) {

charlifu added 4 commits June 27, 2025 20:18
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
Signed-off-by: charlifu <charlifu@amd.com>
@mergify mergify bot added the performance Performance-related issues label Jun 27, 2025
Signed-off-by: charlifu <charlifu@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant