[RFC, NPU] Liger-Kernel x NPU Kernel Contribution Skill

## Motivation

I'm not sure which project to submit this SKILL document to at the moment. So I'm creating this issue for now to communicate with the community.

I'll create a PR later showing a working example of migrating an NPU kernel with this SKILL.

**The following is the specific content of the SKILL document.**


```
name: liger-kernel-npu-kernel
description: >-
  Migrates and optimizes Liger-Kernel operators for NPU with
  API-compatible symbol replacement, UB-capacity-aware tiling, vector-core-
  aligned launch, grid-stride scheduling, 2D block memory access, precision-
  path control, on-device fused dataflow, and parity/performance validation.
```

# Liger-Kernel x NPU Kernel Contribution Skill

## 1. Scope and Objective

This skill is for:

- Porting `src/liger_kernel/ops/<op>.py` to `src/liger_kernel/ops/backends/_ascend/ops/<op>.py`.
- Applying NPU-focused optimizations, not just making code run.
- Maintaining the NPU export and replacement path so upper layer imports resolve to NPU implementations.
- Delivering reproducible numerical and performance evidence for ongoing migration work.

A completed task must satisfy all of the following:

1. API compatibility.
2. NPU-specific optimization landed.
3. Forward and backward parity.
4. Measured performance result or clear bottleneck analysis.

---

## 2. Architecture Facts and Replacement Flow (Mandatory)

### 2.1 Key Paths

- Default operators: `src/liger_kernel/ops/`
- NPU operators: `src/liger_kernel/ops/backends/_ascend/ops/`
- Vendor registration: `src/liger_kernel/ops/backends/_ascend/__init__.py`
- Vendor registry: `src/liger_kernel/ops/backends/registry.py`
- Runtime replacement entry: `src/liger_kernel/ops/__init__.py`

### 2.2 Runtime Replacement Steps

1. `liger_kernel.ops` loads and calls `_replace_with_vendor_ops()`.
2. `infer_device()` resolves device type (`npu`).
3. `get_vendor_for_device("npu")` maps to vendor `ascend`.
4. `liger_kernel.ops.backends._ascend.ops` is imported dynamically.
5. Only symbols listed in the NPU module `__all__` override same-name symbols in `liger_kernel.ops`.

### 2.3 Import Semantics (Always Verify)

- `from liger_kernel.ops import ...`: vendor replacement applies.
- `from liger_kernel.ops.<module> import ...`: usually stays on default implementation.

If NPU `ops/__init__.py` imports and `__all__` are not updated, your NPU operator will not be used by top-level imports.

---

## 3. NPU Affinity Optimization Playbook (Core)

The following optimization points come from existing NPU operators and should be treated as main requirements.

### 3.1 UB-Capacity-Aware Dynamic Tiling

Use:

- `compute_default_tiling_strategy(safety_margin, dtype_size, memory_multiplier, shapes, tiling_dims)`

Core idea:

- Derive safe block sizes from UB capacity instead of hardcoding static block sizes.
- Keep resulting tile sizes as powers of two.
- Prioritize UB safety before throughput tuning.

UB capacity source priority:

1. `ASCEND_UB_CAPACITY_BITS`
2. CANN `get_soc_spec("UB_SIZE")`

Migration guidance:

- Estimate `memory_multiplier` per operator based on peak temporary memory.
- Do not copy CUDA-side block constants directly.

### 3.2 Vector-Core-Aligned Launch Configuration

Use `get_npu_core_count()` to shape grid configuration.

Goals:

- Avoid idle vector cores.
- Avoid over-launching and scheduling overhead.
- Match work split to hardware concurrency.

### 3.3 Grid-Stride Scheduling

Use grid-stride loops for row-wise or block-wise coverage:

- `for row_idx in range(start_row, n_rows, stride)`
- `for block_idx in tl.range(pid, total_blocks, num_progs)`

Benefits:

- Better load balance on large shapes.
- More stable utilization on medium and small shapes.

### 3.4 2D Blocking and Vectorized Memory Access

Common pattern: `BLOCK_SIZE_M x BLOCK_SIZE_N` plus 2D masked load/store.

Benefits:

- Processes multiple rows and columns per program.
- Reduces irregular memory access overhead.
- Fits high-frequency paths such as embedding, normalization, and loss kernels.

### 3.5 Shape-Specific Kernel Paths

Split kernel paths by size group (for example, no-tiling vs tiling).

Principles:

- Small shapes: avoid over-tiling overhead.
- Large shapes: control UB pressure and increase parallel efficiency.

### 3.6 Memory and Cache Strategy

- Load reusable vectors once and reuse when possible.
- Control temporary tensor lifetime to reduce UB pressure.
- Apply eviction policy in hot paths when needed (for example, `evict_first`).
- Use masks strictly to reduce invalid loads/stores.

### 3.7 Precision Path and Casting Control

Define explicitly:

- Where statistics are accumulated (for example, fp32 for stability).
- Where outputs are cast back to target dtype.
- How model-specific branches (for example, Llama or Gemma behavior) are aligned.

Goal:

- Keep numerical stability while avoiding unnecessary high-precision cost.

### 3.8 On-Device Fused Dataflow

For fused operators, prefer NPU sub-kernels and avoid cross-backend mixing.

Typical direction:

- `fused_linear_cross_entropy` should reuse NPU cross entropy internals.
- `fused_neighborhood_attention` should reuse NPU softmax internals.

---

## 4. Standard Migration Procedure (Execution Template)

### Phase A: Lock API Contract

From default operator, lock down:

- `Function` class names
- Public `forward/backward` symbols
- Argument order, dtype constraints, return structure
- `ctx.save_for_backward` and backward dependencies

### Phase B: Port and Apply NPU Optimizations

After copying the file to the NPU path:

- Add UB-aware tiling logic
- Rebuild launch/grid for NPU
- Introduce grid-stride and 2D blocking
- Define precision and cast behavior explicitly
- Preserve autograd semantics (`ensure_contiguous`, saved tensors)

### Phase C: Wire Export

Update `src/liger_kernel/ops/backends/_ascend/ops/__init__.py`:

- Add imports
- Add symbols to `__all__`

### Phase D: Validate and Tune

Minimum validation:

1. Forward correctness
2. Backward parity
3. Small/medium/large shape coverage
4. Performance comparison with recorded kernel parameters

Performance profiling skill:

- For performance analysis, use the NPU-provided `msprof` tool. `msprof` is included in CANN and requires no extra installation.

---

## 5. Migration Checklist

- [ ] New operator file is under `backends/_ascend/ops/`
- [ ] Public symbol names match default implementation
- [ ] UB-capacity-aware tiling is implemented where applicable
- [ ] Grid configuration is aligned to NPU vector core scale
- [ ] Grid-stride scheduling is applied for load balance
- [ ] 2D blocked memory access is evaluated and applied where beneficial
- [ ] Precision and cast path are explicit and consistent
- [ ] NPU `ops/__init__.py` imports and `__all__` are updated
- [ ] Top-level import from `liger_kernel.ops` resolves to NPU symbols
- [ ] Forward and backward errors are within threshold
- [ ] Performance data is recorded for at least three shape groups

---

## 6. Common Failure Patterns and Fixes

- NPU implementation not selected:
  - Check import path (`liger_kernel.ops` vs submodule import)
  - Check vendor registration and NPU `__all__`

- UB failures or instability on large shapes:
  - Revisit `memory_multiplier`
- Revisit `safety_margin` and check if tile size is too aggressive

- No performance gain or regression:
  - Check grid concurrency
  - Check missing 2D blocking or missing shape split
  - Check unnecessary high-precision path

- Numerical drift:
  - Check statistic accumulation precision
  - Check cast-back timing
  - Check branch consistency between forward and backward

- Fused kernel slowdown:
  - Check accidental fallback to default backend sub-kernels
  - Check host-side synchronization breaking on-device pipeline

---

## 7. Definition of Done

An NPU kernel migration is complete only when all conditions are met:

1. API-compatible replacement without upper-layer call changes.
2. Top-level import from `liger_kernel.ops` resolves to NPU symbols.
3. NPU affinity optimizations are present (at minimum: UB tiling + parallelization + load balancing).
4. Forward/backward parity is within agreed thresholds.
5. Performance evidence exists for at least three shape groups.
6. Parameter choice and shape-split reason are documented clearly.

---

## 8. Recommended Task Report Template

Use this structure for every migration delivery:

1. Changed files
2. API compatibility map (symbols and signatures)
3. NPU affinity optimizations applied
4. Tiling and launch parameters (with rationale)
5. Numerical parity results (forward and backward)
6. Performance results by shape group
7. Risks and next optimization steps

---

## 9. Existing NPU Operator Coverage (Reference)

Current NPU exports include:

- embedding
- rms_norm / layer_norm / poly_norm / group_norm
- geglu / swiglu
- rope / llama4_rope / qwen2vl_mrope
- softmax / sparsemax
- cross_entropy / fused_linear_cross_entropy
- jsd / fused_linear_jsd / kl_div / tvd / grpo_loss
- dyt
- fused_neighborhood_attention

New migrations should follow the same export and replacement discipline.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC, NPU] Liger-Kernel x NPU Kernel Contribution Skill #1197

Motivation

Liger-Kernel x NPU Kernel Contribution Skill

1. Scope and Objective

2. Architecture Facts and Replacement Flow (Mandatory)

2.1 Key Paths

2.2 Runtime Replacement Steps

2.3 Import Semantics (Always Verify)

3. NPU Affinity Optimization Playbook (Core)

3.1 UB-Capacity-Aware Dynamic Tiling

3.2 Vector-Core-Aligned Launch Configuration

3.3 Grid-Stride Scheduling

3.4 2D Blocking and Vectorized Memory Access

3.5 Shape-Specific Kernel Paths

3.6 Memory and Cache Strategy

3.7 Precision Path and Casting Control

3.8 On-Device Fused Dataflow

4. Standard Migration Procedure (Execution Template)

Phase A: Lock API Contract

Phase B: Port and Apply NPU Optimizations

Phase C: Wire Export

Phase D: Validate and Tune

5. Migration Checklist

6. Common Failure Patterns and Fixes

7. Definition of Done

8. Recommended Task Report Template

9. Existing NPU Operator Coverage (Reference)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFC, NPU] Liger-Kernel x NPU Kernel Contribution Skill #1197

Description

Motivation

Liger-Kernel x NPU Kernel Contribution Skill

1. Scope and Objective

2. Architecture Facts and Replacement Flow (Mandatory)

2.1 Key Paths

2.2 Runtime Replacement Steps

2.3 Import Semantics (Always Verify)

3. NPU Affinity Optimization Playbook (Core)

3.1 UB-Capacity-Aware Dynamic Tiling

3.2 Vector-Core-Aligned Launch Configuration

3.3 Grid-Stride Scheduling

3.4 2D Blocking and Vectorized Memory Access

3.5 Shape-Specific Kernel Paths

3.6 Memory and Cache Strategy

3.7 Precision Path and Casting Control

3.8 On-Device Fused Dataflow

4. Standard Migration Procedure (Execution Template)

Phase A: Lock API Contract

Phase B: Port and Apply NPU Optimizations

Phase C: Wire Export

Phase D: Validate and Tune

5. Migration Checklist

6. Common Failure Patterns and Fixes

7. Definition of Done

8. Recommended Task Report Template

9. Existing NPU Operator Coverage (Reference)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions