Motivation
I'm not sure which project to submit this SKILL document to at the moment. So I'm creating this issue for now to communicate with the community.
I'll create a PR later showing a working example of migrating an NPU kernel with this SKILL.
The following is the specific content of the SKILL document.
name: liger-kernel-npu-kernel
description: >-
Migrates and optimizes Liger-Kernel operators for NPU with
API-compatible symbol replacement, UB-capacity-aware tiling, vector-core-
aligned launch, grid-stride scheduling, 2D block memory access, precision-
path control, on-device fused dataflow, and parity/performance validation.
Liger-Kernel x NPU Kernel Contribution Skill
1. Scope and Objective
This skill is for:
- Porting
src/liger_kernel/ops/<op>.py to src/liger_kernel/ops/backends/_ascend/ops/<op>.py.
- Applying NPU-focused optimizations, not just making code run.
- Maintaining the NPU export and replacement path so upper layer imports resolve to NPU implementations.
- Delivering reproducible numerical and performance evidence for ongoing migration work.
A completed task must satisfy all of the following:
- API compatibility.
- NPU-specific optimization landed.
- Forward and backward parity.
- Measured performance result or clear bottleneck analysis.
2. Architecture Facts and Replacement Flow (Mandatory)
2.1 Key Paths
- Default operators:
src/liger_kernel/ops/
- NPU operators:
src/liger_kernel/ops/backends/_ascend/ops/
- Vendor registration:
src/liger_kernel/ops/backends/_ascend/__init__.py
- Vendor registry:
src/liger_kernel/ops/backends/registry.py
- Runtime replacement entry:
src/liger_kernel/ops/__init__.py
2.2 Runtime Replacement Steps
liger_kernel.ops loads and calls _replace_with_vendor_ops().
infer_device() resolves device type (npu).
get_vendor_for_device("npu") maps to vendor ascend.
liger_kernel.ops.backends._ascend.ops is imported dynamically.
- Only symbols listed in the NPU module
__all__ override same-name symbols in liger_kernel.ops.
2.3 Import Semantics (Always Verify)
from liger_kernel.ops import ...: vendor replacement applies.
from liger_kernel.ops.<module> import ...: usually stays on default implementation.
If NPU ops/__init__.py imports and __all__ are not updated, your NPU operator will not be used by top-level imports.
3. NPU Affinity Optimization Playbook (Core)
The following optimization points come from existing NPU operators and should be treated as main requirements.
3.1 UB-Capacity-Aware Dynamic Tiling
Use:
compute_default_tiling_strategy(safety_margin, dtype_size, memory_multiplier, shapes, tiling_dims)
Core idea:
- Derive safe block sizes from UB capacity instead of hardcoding static block sizes.
- Keep resulting tile sizes as powers of two.
- Prioritize UB safety before throughput tuning.
UB capacity source priority:
ASCEND_UB_CAPACITY_BITS
- CANN
get_soc_spec("UB_SIZE")
Migration guidance:
- Estimate
memory_multiplier per operator based on peak temporary memory.
- Do not copy CUDA-side block constants directly.
3.2 Vector-Core-Aligned Launch Configuration
Use get_npu_core_count() to shape grid configuration.
Goals:
- Avoid idle vector cores.
- Avoid over-launching and scheduling overhead.
- Match work split to hardware concurrency.
3.3 Grid-Stride Scheduling
Use grid-stride loops for row-wise or block-wise coverage:
for row_idx in range(start_row, n_rows, stride)
for block_idx in tl.range(pid, total_blocks, num_progs)
Benefits:
- Better load balance on large shapes.
- More stable utilization on medium and small shapes.
3.4 2D Blocking and Vectorized Memory Access
Common pattern: BLOCK_SIZE_M x BLOCK_SIZE_N plus 2D masked load/store.
Benefits:
- Processes multiple rows and columns per program.
- Reduces irregular memory access overhead.
- Fits high-frequency paths such as embedding, normalization, and loss kernels.
3.5 Shape-Specific Kernel Paths
Split kernel paths by size group (for example, no-tiling vs tiling).
Principles:
- Small shapes: avoid over-tiling overhead.
- Large shapes: control UB pressure and increase parallel efficiency.
3.6 Memory and Cache Strategy
- Load reusable vectors once and reuse when possible.
- Control temporary tensor lifetime to reduce UB pressure.
- Apply eviction policy in hot paths when needed (for example,
evict_first).
- Use masks strictly to reduce invalid loads/stores.
3.7 Precision Path and Casting Control
Define explicitly:
- Where statistics are accumulated (for example, fp32 for stability).
- Where outputs are cast back to target dtype.
- How model-specific branches (for example, Llama or Gemma behavior) are aligned.
Goal:
- Keep numerical stability while avoiding unnecessary high-precision cost.
3.8 On-Device Fused Dataflow
For fused operators, prefer NPU sub-kernels and avoid cross-backend mixing.
Typical direction:
fused_linear_cross_entropy should reuse NPU cross entropy internals.
fused_neighborhood_attention should reuse NPU softmax internals.
4. Standard Migration Procedure (Execution Template)
Phase A: Lock API Contract
From default operator, lock down:
Function class names
- Public
forward/backward symbols
- Argument order, dtype constraints, return structure
ctx.save_for_backward and backward dependencies
Phase B: Port and Apply NPU Optimizations
After copying the file to the NPU path:
- Add UB-aware tiling logic
- Rebuild launch/grid for NPU
- Introduce grid-stride and 2D blocking
- Define precision and cast behavior explicitly
- Preserve autograd semantics (
ensure_contiguous, saved tensors)
Phase C: Wire Export
Update src/liger_kernel/ops/backends/_ascend/ops/__init__.py:
- Add imports
- Add symbols to
__all__
Phase D: Validate and Tune
Minimum validation:
- Forward correctness
- Backward parity
- Small/medium/large shape coverage
- Performance comparison with recorded kernel parameters
Performance profiling skill:
- For performance analysis, use the NPU-provided
msprof tool. msprof is included in CANN and requires no extra installation.
5. Migration Checklist
6. Common Failure Patterns and Fixes
-
NPU implementation not selected:
- Check import path (
liger_kernel.ops vs submodule import)
- Check vendor registration and NPU
__all__
-
UB failures or instability on large shapes:
- Revisit
memory_multiplier
-
Revisit safety_margin and check if tile size is too aggressive
-
No performance gain or regression:
- Check grid concurrency
- Check missing 2D blocking or missing shape split
- Check unnecessary high-precision path
-
Numerical drift:
- Check statistic accumulation precision
- Check cast-back timing
- Check branch consistency between forward and backward
-
Fused kernel slowdown:
- Check accidental fallback to default backend sub-kernels
- Check host-side synchronization breaking on-device pipeline
7. Definition of Done
An NPU kernel migration is complete only when all conditions are met:
- API-compatible replacement without upper-layer call changes.
- Top-level import from
liger_kernel.ops resolves to NPU symbols.
- NPU affinity optimizations are present (at minimum: UB tiling + parallelization + load balancing).
- Forward/backward parity is within agreed thresholds.
- Performance evidence exists for at least three shape groups.
- Parameter choice and shape-split reason are documented clearly.
8. Recommended Task Report Template
Use this structure for every migration delivery:
- Changed files
- API compatibility map (symbols and signatures)
- NPU affinity optimizations applied
- Tiling and launch parameters (with rationale)
- Numerical parity results (forward and backward)
- Performance results by shape group
- Risks and next optimization steps
9. Existing NPU Operator Coverage (Reference)
Current NPU exports include:
- embedding
- rms_norm / layer_norm / poly_norm / group_norm
- geglu / swiglu
- rope / llama4_rope / qwen2vl_mrope
- softmax / sparsemax
- cross_entropy / fused_linear_cross_entropy
- jsd / fused_linear_jsd / kl_div / tvd / grpo_loss
- dyt
- fused_neighborhood_attention
New migrations should follow the same export and replacement discipline.
Motivation
I'm not sure which project to submit this SKILL document to at the moment. So I'm creating this issue for now to communicate with the community.
I'll create a PR later showing a working example of migrating an NPU kernel with this SKILL.
The following is the specific content of the SKILL document.
Liger-Kernel x NPU Kernel Contribution Skill
1. Scope and Objective
This skill is for:
src/liger_kernel/ops/<op>.pytosrc/liger_kernel/ops/backends/_ascend/ops/<op>.py.A completed task must satisfy all of the following:
2. Architecture Facts and Replacement Flow (Mandatory)
2.1 Key Paths
src/liger_kernel/ops/src/liger_kernel/ops/backends/_ascend/ops/src/liger_kernel/ops/backends/_ascend/__init__.pysrc/liger_kernel/ops/backends/registry.pysrc/liger_kernel/ops/__init__.py2.2 Runtime Replacement Steps
liger_kernel.opsloads and calls_replace_with_vendor_ops().infer_device()resolves device type (npu).get_vendor_for_device("npu")maps to vendorascend.liger_kernel.ops.backends._ascend.opsis imported dynamically.__all__override same-name symbols inliger_kernel.ops.2.3 Import Semantics (Always Verify)
from liger_kernel.ops import ...: vendor replacement applies.from liger_kernel.ops.<module> import ...: usually stays on default implementation.If NPU
ops/__init__.pyimports and__all__are not updated, your NPU operator will not be used by top-level imports.3. NPU Affinity Optimization Playbook (Core)
The following optimization points come from existing NPU operators and should be treated as main requirements.
3.1 UB-Capacity-Aware Dynamic Tiling
Use:
compute_default_tiling_strategy(safety_margin, dtype_size, memory_multiplier, shapes, tiling_dims)Core idea:
UB capacity source priority:
ASCEND_UB_CAPACITY_BITSget_soc_spec("UB_SIZE")Migration guidance:
memory_multiplierper operator based on peak temporary memory.3.2 Vector-Core-Aligned Launch Configuration
Use
get_npu_core_count()to shape grid configuration.Goals:
3.3 Grid-Stride Scheduling
Use grid-stride loops for row-wise or block-wise coverage:
for row_idx in range(start_row, n_rows, stride)for block_idx in tl.range(pid, total_blocks, num_progs)Benefits:
3.4 2D Blocking and Vectorized Memory Access
Common pattern:
BLOCK_SIZE_M x BLOCK_SIZE_Nplus 2D masked load/store.Benefits:
3.5 Shape-Specific Kernel Paths
Split kernel paths by size group (for example, no-tiling vs tiling).
Principles:
3.6 Memory and Cache Strategy
evict_first).3.7 Precision Path and Casting Control
Define explicitly:
Goal:
3.8 On-Device Fused Dataflow
For fused operators, prefer NPU sub-kernels and avoid cross-backend mixing.
Typical direction:
fused_linear_cross_entropyshould reuse NPU cross entropy internals.fused_neighborhood_attentionshould reuse NPU softmax internals.4. Standard Migration Procedure (Execution Template)
Phase A: Lock API Contract
From default operator, lock down:
Functionclass namesforward/backwardsymbolsctx.save_for_backwardand backward dependenciesPhase B: Port and Apply NPU Optimizations
After copying the file to the NPU path:
ensure_contiguous, saved tensors)Phase C: Wire Export
Update
src/liger_kernel/ops/backends/_ascend/ops/__init__.py:__all__Phase D: Validate and Tune
Minimum validation:
Performance profiling skill:
msproftool.msprofis included in CANN and requires no extra installation.5. Migration Checklist
backends/_ascend/ops/ops/__init__.pyimports and__all__are updatedliger_kernel.opsresolves to NPU symbols6. Common Failure Patterns and Fixes
NPU implementation not selected:
liger_kernel.opsvs submodule import)__all__UB failures or instability on large shapes:
memory_multiplierRevisit
safety_marginand check if tile size is too aggressiveNo performance gain or regression:
Numerical drift:
Fused kernel slowdown:
7. Definition of Done
An NPU kernel migration is complete only when all conditions are met:
liger_kernel.opsresolves to NPU symbols.8. Recommended Task Report Template
Use this structure for every migration delivery:
9. Existing NPU Operator Coverage (Reference)
Current NPU exports include:
New migrations should follow the same export and replacement discipline.