[NPU]: NPU-optimized fused_add_rms_norm kernel by TianHao324 · Pull Request #1070 · linkedin/Liger-Kernel

TianHao324 · 2026-02-05T06:50:05Z

Summary

Because the original kernel uses n_cols as BLOCK_SIZE, and n_cols is smaller in the test, the test can pass normally. However, in the benchmark, n_cols is larger, and when running on the NPU, an ub overflow occurs. Therefore, for each row, we process it in chunks of BLOCK_SIZE.
Grid size is limited to NPU core count to avoid resource overflow
Each program handles multiple rows
Due to the device limitations of the NPU, there is still room for performance improvement. This modification changes it to the NPU-supported format first, but compared to the previous GPU format, the performance has improved significantly.

Testing Done

Hardware Type: Atlas 800I A2
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

Tcc0403

LGTM

TianHao324 · 2026-02-12T07:16:32Z

@Tcc0403 Sorry, I haven't submitted some performance optimization changes in time. I'm creating a new PR instead. Or should I cancel this merge and let me update the modifications?

…1100) ## Summary  Based on #1070 Because the original kernel uses n_cols as BLOCK_SIZE, and n_cols is smaller in the test, the test can pass normally. However, in the benchmark, n_cols is larger, and when running on the NPU, an ub overflow occurs. Therefore, for each row, we process it in chunks of BLOCK_SIZE. Maintain high performance even when using a smaller hidden size in most models, and also ensure support in cases where a larger hidden size is used.  ## Testing Done  <img width="1564" height="439" alt="image" src="https://github.com/user-attachments/assets/9de1c501-db2f-4dc1-9808-f3bf6e5abd75" />  - Hardware Type: Atlas 800I A2 - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [ ] run `make test-convergence` to ensure convergence

TianHao324 force-pushed the add_rms_npu branch 2 times, most recently from a3629c9 to 1795973 Compare February 9, 2026 11:32

[NPU]: NPU-optimized fused_add_rms_norm forward kernel

425ccf4

TianHao324 force-pushed the add_rms_npu branch from 1795973 to 425ccf4 Compare February 9, 2026 11:41

TianHao324 changed the title ~~[NPU]: NPU-optimized fused_add_rms_norm forward kernel~~ [NPU]: NPU-optimized fused_add_rms_norm kernel Feb 11, 2026

Tcc0403 approved these changes Feb 12, 2026

View reviewed changes

Tcc0403 added this pull request to the merge queue Feb 12, 2026

Merged via the queue into linkedin:main with commit 60f6c84 Feb 12, 2026
3 of 7 checks passed

TianHao324 mentioned this pull request Feb 12, 2026

[NPU]: fused_add_rms_norm kernel distinguish the chunking strategy #1100

Merged

3 tasks

zheliuyu mentioned this pull request Apr 14, 2026

[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-Kernel #969

Open

41 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[NPU]: NPU-optimized fused_add_rms_norm kernel#1070

[NPU]: NPU-optimized fused_add_rms_norm kernel#1070
Tcc0403 merged 1 commit into
linkedin:mainfrom
TianHao324:add_rms_npu

TianHao324 commented Feb 5, 2026

Uh oh!

Tcc0403 left a comment

Uh oh!

Uh oh!

TianHao324 commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

TianHao324 commented Feb 5, 2026

Summary

Testing Done

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TianHao324 commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants