Skip to content

[NPU]: NPU-optimized fused_add_rms_norm kernel#1070

Merged
Tcc0403 merged 1 commit into
linkedin:mainfrom
TianHao324:add_rms_npu
Feb 12, 2026
Merged

[NPU]: NPU-optimized fused_add_rms_norm kernel#1070
Tcc0403 merged 1 commit into
linkedin:mainfrom
TianHao324:add_rms_npu

Conversation

@TianHao324

Copy link
Copy Markdown
Contributor

Summary

  1. Because the original kernel uses n_cols as BLOCK_SIZE, and n_cols is smaller in the test, the test can pass normally. However, in the benchmark, n_cols is larger, and when running on the NPU, an ub overflow occurs. Therefore, for each row, we process it in chunks of BLOCK_SIZE.
  2. Grid size is limited to NPU core count to avoid resource overflow
  3. Each program handles multiple rows
  4. Due to the device limitations of the NPU, there is still room for performance improvement. This modification changes it to the NPU-supported format first, but compared to the previous GPU format, the performance has improved significantly.

Testing Done

image
  • Hardware Type: Atlas 800I A2
  • run make test to ensure correctness
  • run make checkstyle to ensure code style
  • run make test-convergence to ensure convergence

@TianHao324 TianHao324 force-pushed the add_rms_npu branch 2 times, most recently from a3629c9 to 1795973 Compare February 9, 2026 11:32
@TianHao324 TianHao324 changed the title [NPU]: NPU-optimized fused_add_rms_norm forward kernel [NPU]: NPU-optimized fused_add_rms_norm kernel Feb 11, 2026

@Tcc0403 Tcc0403 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Tcc0403 Tcc0403 added this pull request to the merge queue Feb 12, 2026
Merged via the queue into linkedin:main with commit 60f6c84 Feb 12, 2026
3 of 7 checks passed
@TianHao324

Copy link
Copy Markdown
Contributor Author

@Tcc0403 Sorry, I haven't submitted some performance optimization changes in time. I'm creating a new PR instead. Or should I cancel this merge and let me update the modifications?

github-merge-queue Bot pushed a commit that referenced this pull request Feb 25, 2026
…1100)

## Summary
<!--- This is a required section; please describe the main purpose of
this proposed code change. --->
Based on #1070 
Because the original kernel uses n_cols as BLOCK_SIZE, and n_cols is
smaller in the test, the test can pass normally. However, in the
benchmark, n_cols is larger, and when running on the NPU, an ub overflow
occurs. Therefore, for each row, we process it in chunks of BLOCK_SIZE.
Maintain high performance even when using a smaller hidden size in most
models, and also ensure support in cases where a larger hidden size is
used.
<!---
## Details
This is an optional section; is there anything specific that reviewers
should be aware of?
--->

## Testing Done
<!--- This is a required section; please describe how this change was
tested. --->
<img width="1564" height="439" alt="image"
src="https://github.com/user-attachments/assets/9de1c501-db2f-4dc1-9808-f3bf6e5abd75"
/>

<!-- 
Replace BLANK with your device type. For example, A100-80G-PCIe

Complete the following tasks before sending your PR, and replace `[ ]`
with
`[x]` to indicate you have done them. 
-->

- Hardware Type: Atlas 800I A2
- [x] run `make test` to ensure correctness
- [x] run `make checkstyle` to ensure code style
- [ ] run `make test-convergence` to ensure convergence
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants