Skip to content

[RFC, NPU] Liger-Kernel x NPU Kernel Contribution Skill #1197

@zheliuyu

Description

@zheliuyu

Motivation

I'm not sure which project to submit this SKILL document to at the moment. So I'm creating this issue for now to communicate with the community.

I'll create a PR later showing a working example of migrating an NPU kernel with this SKILL.

The following is the specific content of the SKILL document.

name: liger-kernel-npu-kernel
description: >-
  Migrates and optimizes Liger-Kernel operators for NPU with
  API-compatible symbol replacement, UB-capacity-aware tiling, vector-core-
  aligned launch, grid-stride scheduling, 2D block memory access, precision-
  path control, on-device fused dataflow, and parity/performance validation.

Liger-Kernel x NPU Kernel Contribution Skill

1. Scope and Objective

This skill is for:

  • Porting src/liger_kernel/ops/<op>.py to src/liger_kernel/ops/backends/_ascend/ops/<op>.py.
  • Applying NPU-focused optimizations, not just making code run.
  • Maintaining the NPU export and replacement path so upper layer imports resolve to NPU implementations.
  • Delivering reproducible numerical and performance evidence for ongoing migration work.

A completed task must satisfy all of the following:

  1. API compatibility.
  2. NPU-specific optimization landed.
  3. Forward and backward parity.
  4. Measured performance result or clear bottleneck analysis.

2. Architecture Facts and Replacement Flow (Mandatory)

2.1 Key Paths

  • Default operators: src/liger_kernel/ops/
  • NPU operators: src/liger_kernel/ops/backends/_ascend/ops/
  • Vendor registration: src/liger_kernel/ops/backends/_ascend/__init__.py
  • Vendor registry: src/liger_kernel/ops/backends/registry.py
  • Runtime replacement entry: src/liger_kernel/ops/__init__.py

2.2 Runtime Replacement Steps

  1. liger_kernel.ops loads and calls _replace_with_vendor_ops().
  2. infer_device() resolves device type (npu).
  3. get_vendor_for_device("npu") maps to vendor ascend.
  4. liger_kernel.ops.backends._ascend.ops is imported dynamically.
  5. Only symbols listed in the NPU module __all__ override same-name symbols in liger_kernel.ops.

2.3 Import Semantics (Always Verify)

  • from liger_kernel.ops import ...: vendor replacement applies.
  • from liger_kernel.ops.<module> import ...: usually stays on default implementation.

If NPU ops/__init__.py imports and __all__ are not updated, your NPU operator will not be used by top-level imports.


3. NPU Affinity Optimization Playbook (Core)

The following optimization points come from existing NPU operators and should be treated as main requirements.

3.1 UB-Capacity-Aware Dynamic Tiling

Use:

  • compute_default_tiling_strategy(safety_margin, dtype_size, memory_multiplier, shapes, tiling_dims)

Core idea:

  • Derive safe block sizes from UB capacity instead of hardcoding static block sizes.
  • Keep resulting tile sizes as powers of two.
  • Prioritize UB safety before throughput tuning.

UB capacity source priority:

  1. ASCEND_UB_CAPACITY_BITS
  2. CANN get_soc_spec("UB_SIZE")

Migration guidance:

  • Estimate memory_multiplier per operator based on peak temporary memory.
  • Do not copy CUDA-side block constants directly.

3.2 Vector-Core-Aligned Launch Configuration

Use get_npu_core_count() to shape grid configuration.

Goals:

  • Avoid idle vector cores.
  • Avoid over-launching and scheduling overhead.
  • Match work split to hardware concurrency.

3.3 Grid-Stride Scheduling

Use grid-stride loops for row-wise or block-wise coverage:

  • for row_idx in range(start_row, n_rows, stride)
  • for block_idx in tl.range(pid, total_blocks, num_progs)

Benefits:

  • Better load balance on large shapes.
  • More stable utilization on medium and small shapes.

3.4 2D Blocking and Vectorized Memory Access

Common pattern: BLOCK_SIZE_M x BLOCK_SIZE_N plus 2D masked load/store.

Benefits:

  • Processes multiple rows and columns per program.
  • Reduces irregular memory access overhead.
  • Fits high-frequency paths such as embedding, normalization, and loss kernels.

3.5 Shape-Specific Kernel Paths

Split kernel paths by size group (for example, no-tiling vs tiling).

Principles:

  • Small shapes: avoid over-tiling overhead.
  • Large shapes: control UB pressure and increase parallel efficiency.

3.6 Memory and Cache Strategy

  • Load reusable vectors once and reuse when possible.
  • Control temporary tensor lifetime to reduce UB pressure.
  • Apply eviction policy in hot paths when needed (for example, evict_first).
  • Use masks strictly to reduce invalid loads/stores.

3.7 Precision Path and Casting Control

Define explicitly:

  • Where statistics are accumulated (for example, fp32 for stability).
  • Where outputs are cast back to target dtype.
  • How model-specific branches (for example, Llama or Gemma behavior) are aligned.

Goal:

  • Keep numerical stability while avoiding unnecessary high-precision cost.

3.8 On-Device Fused Dataflow

For fused operators, prefer NPU sub-kernels and avoid cross-backend mixing.

Typical direction:

  • fused_linear_cross_entropy should reuse NPU cross entropy internals.
  • fused_neighborhood_attention should reuse NPU softmax internals.

4. Standard Migration Procedure (Execution Template)

Phase A: Lock API Contract

From default operator, lock down:

  • Function class names
  • Public forward/backward symbols
  • Argument order, dtype constraints, return structure
  • ctx.save_for_backward and backward dependencies

Phase B: Port and Apply NPU Optimizations

After copying the file to the NPU path:

  • Add UB-aware tiling logic
  • Rebuild launch/grid for NPU
  • Introduce grid-stride and 2D blocking
  • Define precision and cast behavior explicitly
  • Preserve autograd semantics (ensure_contiguous, saved tensors)

Phase C: Wire Export

Update src/liger_kernel/ops/backends/_ascend/ops/__init__.py:

  • Add imports
  • Add symbols to __all__

Phase D: Validate and Tune

Minimum validation:

  1. Forward correctness
  2. Backward parity
  3. Small/medium/large shape coverage
  4. Performance comparison with recorded kernel parameters

Performance profiling skill:

  • For performance analysis, use the NPU-provided msprof tool. msprof is included in CANN and requires no extra installation.

5. Migration Checklist

  • New operator file is under backends/_ascend/ops/
  • Public symbol names match default implementation
  • UB-capacity-aware tiling is implemented where applicable
  • Grid configuration is aligned to NPU vector core scale
  • Grid-stride scheduling is applied for load balance
  • 2D blocked memory access is evaluated and applied where beneficial
  • Precision and cast path are explicit and consistent
  • NPU ops/__init__.py imports and __all__ are updated
  • Top-level import from liger_kernel.ops resolves to NPU symbols
  • Forward and backward errors are within threshold
  • Performance data is recorded for at least three shape groups

6. Common Failure Patterns and Fixes

  • NPU implementation not selected:

    • Check import path (liger_kernel.ops vs submodule import)
    • Check vendor registration and NPU __all__
  • UB failures or instability on large shapes:

    • Revisit memory_multiplier
  • Revisit safety_margin and check if tile size is too aggressive

  • No performance gain or regression:

    • Check grid concurrency
    • Check missing 2D blocking or missing shape split
    • Check unnecessary high-precision path
  • Numerical drift:

    • Check statistic accumulation precision
    • Check cast-back timing
    • Check branch consistency between forward and backward
  • Fused kernel slowdown:

    • Check accidental fallback to default backend sub-kernels
    • Check host-side synchronization breaking on-device pipeline

7. Definition of Done

An NPU kernel migration is complete only when all conditions are met:

  1. API-compatible replacement without upper-layer call changes.
  2. Top-level import from liger_kernel.ops resolves to NPU symbols.
  3. NPU affinity optimizations are present (at minimum: UB tiling + parallelization + load balancing).
  4. Forward/backward parity is within agreed thresholds.
  5. Performance evidence exists for at least three shape groups.
  6. Parameter choice and shape-split reason are documented clearly.

8. Recommended Task Report Template

Use this structure for every migration delivery:

  1. Changed files
  2. API compatibility map (symbols and signatures)
  3. NPU affinity optimizations applied
  4. Tiling and launch parameters (with rationale)
  5. Numerical parity results (forward and backward)
  6. Performance results by shape group
  7. Risks and next optimization steps

9. Existing NPU Operator Coverage (Reference)

Current NPU exports include:

  • embedding
  • rms_norm / layer_norm / poly_norm / group_norm
  • geglu / swiglu
  • rope / llama4_rope / qwen2vl_mrope
  • softmax / sparsemax
  • cross_entropy / fused_linear_cross_entropy
  • jsd / fused_linear_jsd / kl_div / tvd / grpo_loss
  • dyt
  • fused_neighborhood_attention

New migrations should follow the same export and replacement discipline.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions