[NPU]: Add NPU support for the embedding by TianHao324 · Pull Request #1028 · linkedin/Liger-Kernel

TianHao324 · 2026-01-19T11:53:01Z

Summary

Add NPU support for the embedding.

Implements a flattened, grid-stride Triton kernel for embedding forward/backward to improve scalability and reduce launch overhead on Ascend NPUs.
Uses UB-aware tiling (compute_default_tiling_strategy) and NPU vector core count to dynamically select block size and grid size for better performance stability.

Testing Done

I tested swiglu by following method and all cases passed:

python benchmark/scripts/benchmark_embedding.py
pytest -v test/transformers/test_embedding.py

Hardware Type: Ascend NPU 910B4
run make test to ensure correctness
run make checkstyle to ensure code style
run make test-convergence to ensure convergence

TianHao324 · 2026-01-19T11:54:26Z

test_embedding result：

TianHao324 · 2026-01-19T11:55:43Z

Hi @Tcc0403, could you please help me review my code?

Tcc0403

It seems the current implementation is quite inefficient. I've left some comments about some possible issues it might have.

Tcc0403 · 2026-01-19T14:16:58Z

+        )
+
+
+def get_optimal_block_size(total_elements, is_backward: bool):


what does is_backward do?

Sorry, at first I intended to distinguish the forward and backward directions. Later, I realized their logic was quite similar and I forgot to delete it.

Tcc0403 · 2026-01-19T15:51:22Z

+@triton.jit
+def embedding_forward_kernel(
+    embeddings_ptr,
+    indices_ptr,
+    output_ptr,
+    total_elements,
+    n_elements,
+    embedding_dim: tl.constexpr,
+    BLOCK_SIZE: tl.constexpr,
+    NUM_STAGES: tl.constexpr,
+):


I think the original implementation with 2 block sizes for tile shape is more readable and more efficient.

persistant grid loop is fine, but the way this kernel loading embedding seems to be uncoalesced at some point.

For instance, there will be some dim_idx not consecutive if BLOCK_SIZE is not multiple of embedding_dim. It will make the second tl.load trying to access different rows within a warp, as well as the last store.

Make these offsets created with 2d block size is more readable and efficient since we can avoid the uncoalesced access mentioned above.

I have changed it to 2D block. After testing, it has indeed shown much better performance. The issues mentioned below have also been fixed. Could you please review it for me again?

Tcc0403 · 2026-01-19T15:55:33Z

+    tile_shapes = compute_default_tiling_strategy(
+        safety_margin=0.9, dtype_size=4, memory_multiplier=multiplier, shapes=((total_elements,),), tiling_dims=(0,)
+    )


dtype_size should be embedding.dtype?

Tcc0403 · 2026-01-19T15:59:01Z

+        block_size = tile_shapes[0][0]
+        return block_size
+    else:
+        return triton.next_power_of_2(total_elements)


I think fallback value should be workable, triton.next_power_of_2(total_elements) is too large.

Tcc0403 · 2026-01-19T17:01:43Z

+            embeddings_ptr + embedding_offsets,
+            mask=final_mask,
+            other=0.0,
+        ).to(tl.float32)


any consideration why we need to upcast it?

Tcc0403 · 2026-01-20T10:57:12Z

Could you attach the benchmark results for reference?

TianHao324 · 2026-01-20T11:12:59Z

Could you attach the benchmark results for reference?

Currently, compared to the previous version, the performance has improved by 4 to 5 times. However, it still has a significant difference compared to HuggingFace. But I attempted to use the original GPU code (only addressing the UB issue), and the performance was nearly the same (the results are shown below).

[
  {
    "kernel_name": "embedding",
    "kernel_provider": "liger",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "V",
    "x_label": "embedding dimension",
    "x_values": [
      1024,
      2048,
      4096,
      8192,
      16384,
      32768,
      65536,
      131072
    ],
    "y_values_50": [
      42.66733932495117,
      43.84379959106445,
      43.834800720214844,
      43.53144836425781,
      43.65476989746094,
      42.79145050048828,
      44.18817138671875,
      44.12928009033203
    ],
    "y_values_20": [
      42.66537094116211,
      43.84306716918945,
      43.83445358276367,
      43.531349182128906,
      43.65372085571289,
      42.7907829284668,
      44.18741989135742,
      44.12871551513672
    ],
    "y_values_80": [
      42.669307708740234,
      43.84453201293945,
      43.835147857666016,
      43.531551361083984,
      43.655818939208984,
      42.792118072509766,
      44.18891906738281,
      44.129844665527344
    ],
    "timestamp": "2026-01-20 10:33:22",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"B\": 32, \"T\": 512, \"D\": 768, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.0.0"
  },
  {
    "kernel_name": "embedding",
    "kernel_provider": "huggingface",
    "metric_name": "speed",
    "metric_unit": "ms",
    "gpu_name": "Ascend910B4",
    "x_name": "V",
    "x_label": "embedding dimension",
    "x_values": [
      1024,
      2048,
      4096,
      8192,
      16384,
      32768,
      65536,
      131072
    ],
    "y_values_50": [
      0.08077999949455261,
      0.091559998691082,
      0.1134599968791008,
      0.14830000698566437,
      0.1863200068473816,
      0.21172000467777252,
      0.22543999552726746,
      0.2385600060224533
    ],
    "y_values_20": [
      0.08038800209760666,
      0.09114000201225281,
      0.11287999898195267,
      0.14771999418735504,
      0.18585199117660522,
      0.21121999621391296,
      0.22499999403953552,
      0.23792000114917755
    ],
    "y_values_80": [
      0.08191999793052673,
      0.09239999949932098,
      0.11416800320148468,
      0.14903999865055084,
      0.18700000643730164,
      0.21240000426769257,
      0.22592000663280487,
      0.23929999768733978
    ],
    "timestamp": "2026-01-20 10:33:35",
    "kernel_operation_mode": "forward",
    "extra_benchmark_config_str": "{\"B\": 32, \"T\": 512, \"D\": 768, \"dtype\": \"torch.float32\"}",
    "liger_version": "0.0.0"
  },

Implement using GPU：

Tcc0403

I'm fine with merging this PR since it's an experimental operator and isn’t used in any patching path. That said, we should probably open a performance issue for this kernel and track it for future improvements.

TianHao324 · 2026-01-21T07:25:46Z

I'm fine with merging this PR since it's an experimental operator and isn’t used in any patching path. That said, we should probably open a performance issue for this kernel and track it for future improvements.

You're right. In fact, we do have plans to improve the performance. Currently, we need to first support these operators on the NPU and explore ways to optimize the performance as much as possible.

Tcc0403 · 2026-01-21T07:26:32Z

Could you open an issue with benchmarking results so we can track this performance problem and allow future contributors to work on it?

TianHao324 · 2026-01-21T07:35:39Z

Could you open an issue with benchmarking results so we can track this performance problem and allow future contributors to work on it?

Sure! #1036

Tcc0403 · 2026-01-21T07:38:58Z

Thank you!

TianHao324 mentioned this pull request Jan 19, 2026

[NPU Roadmap, Updated to 2026-Q2] NPU support for Liger-Kernel #969

Open

41 tasks

Tcc0403 requested changes Jan 19, 2026

View reviewed changes

[NPU]: Add NPU support for the embedding

fc2b201

TianHao324 force-pushed the embedding branch from 9a6b4cf to fc2b201 Compare January 20, 2026 10:50

Tcc0403 reviewed Jan 21, 2026

View reviewed changes

Tcc0403 approved these changes Jan 21, 2026

View reviewed changes

Tcc0403 merged commit 57e98d3 into linkedin:main Jan 21, 2026
3 of 7 checks passed

		)


		def get_optimal_block_size(total_elements, is_backward: bool):

Uh oh!

Conversation

TianHao324 commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing Done

Uh oh!

TianHao324 commented Jan 19, 2026

Uh oh!

TianHao324 commented Jan 19, 2026

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Tcc0403 commented Jan 20, 2026

Uh oh!

TianHao324 commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Tcc0403 left a comment

Choose a reason for hiding this comment

Uh oh!

TianHao324 commented Jan 21, 2026

Uh oh!

Tcc0403 commented Jan 21, 2026

Uh oh!

TianHao324 commented Jan 21, 2026

Uh oh!

Tcc0403 commented Jan 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TianHao324 commented Jan 19, 2026 •

edited

Loading

TianHao324 commented Jan 20, 2026 •

edited

Loading