[NPU]: NPU-optimized fused_add_rms_norm kernel#1070
Merged
Conversation
a3629c9 to
1795973
Compare
1795973 to
425ccf4
Compare
Contributor
Author
|
@Tcc0403 Sorry, I haven't submitted some performance optimization changes in time. I'm creating a new PR instead. Or should I cancel this merge and let me update the modifications? |
3 tasks
github-merge-queue Bot
pushed a commit
that referenced
this pull request
Feb 25, 2026
…1100) ## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Based on #1070 Because the original kernel uses n_cols as BLOCK_SIZE, and n_cols is smaller in the test, the test can pass normally. However, in the benchmark, n_cols is larger, and when running on the NPU, an ub overflow occurs. Therefore, for each row, we process it in chunks of BLOCK_SIZE. Maintain high performance even when using a smaller hidden size in most models, and also ensure support in cases where a larger hidden size is used. <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> <img width="1564" height="439" alt="image" src="https://github.com/user-attachments/assets/9de1c501-db2f-4dc1-9808-f3bf6e5abd75" /> <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: Atlas 800I A2 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence
41 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Testing Done
make testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence