[NPU]RFC: Ascend CI Integration

## Summary

This RFC proposes integrating a Continuous Integration (CI) system for Ascend NPU into the Liger-Kernel project to continuously monitor the support status of operators in the `ops/` directory on Ascend devices, ensuring code quality and functional correctness.

## Background and Motivation

### Current Status

Liger-Kernel already supports CI for multiple hardware platforms:
- **NVIDIA GPU** (CUDA) - via Modal CI
- **AMD GPU** (ROCm) - via self-hosted runner
- **Intel GPU** (XPU) - via self-hosted runner

Ascend NPU support has been implemented at the code level (`setup.py` platform detection, `backends/_ascend/` backend architecture), but lacks automated CI coverage.

### Completed Work

Currently, 24 operators require adaptation for Ascend devices, of which 18 operators have passed accuracy verification.

| Kernel Name                  | File Path                          | Accuracy | Description                      |
| ---------------------------- | --------------------------------- | -------- | -------------------------------- |
| Cross Entropy                | `cross_entropy.py`                | ✅       | Cross-entropy loss function      |
| DyT                          | `dyt.py`                          | ✅       | DyT normalization operation      |
| Embedding                    | `experimental/embedding.py`       | 🟡       | Embedding layer (experimental)    |
| Fused Add RMS Norm           | `fused_add_rms_norm.py`           | ✅       | Fused Add + RMS Norm             |
| Fused Linear Cross Entropy   | `fused_linear_cross_entropy.py`   | 🟡       | Fused Linear + Cross Entropy     |
| Fused Linear JSD             | `fused_linear_jsd.py`             | 🟡       | Fused Linear + JSD               |
| Fused Neighborhood Attention | `fused_neighborhood_attention.py` | 🟡       | Fused neighborhood attention      |
| GEGLU                        | `geglu.py`                        | ✅       | GELU gated linear unit           |
| Group Norm                   | `group_norm.py`                   | ✅       | Group normalization              |
| GRPO Loss                    | `grpo_loss.py`                    | ✅       | GRPO loss function               |
| JSD                          | `jsd.py`                          | 🟡       | Jensen-Shannon divergence        |
| KL Div                       | `kl_div.py`                       | ✅       | KL divergence loss                |
| Layer Norm                   | `layer_norm.py`                   | ✅       | Layer normalization              |
| Llama4 ROPE                  | `llama4_rope.py`                  | 🟡       | Llama4 rotary position encoding  |
| Multi Token Attention        | `multi_token_attention.py`        | 🟡       | Multi-token attention            |
| Poly Norm                    | `poly_norm.py`                    | ✅       | Polynomial normalization         |
| Qwen2VL MRope                | `qwen2vl_mrope.py`                | ✅       | Qwen2VL multi-rotary position encoding |
| RMS Norm                     | `rms_norm.py`                     | ✅       | RMS normalization                |
| ROPE                         | `rope.py`                         | ✅       | Rotary position encoding         |
| Softmax                      | `softmax.py`                      | ✅       | Softmax activation function      |
| Sparsemax                    | `sparsemax.py`                    | ✅       | Sparsemax activation function    |
| SWIGLU                       | `swiglu.py`                       | ✅       | SiLU gated linear unit           |
| Tiled MLP                    | `tiled_mlp.py`                    | ✅       | Tiled MLP                        |
| TVD                          | `tvd.py`                          | ✅       | TVD loss function                |

**Note**: ✅ Accuracy verified, 🟡 In progress.

### Problem Statement

The correctness of Ascend operators currently relies on manual testing, lacking automated CI coverage. This creates regression risks, and community contributors cannot quickly verify the correctness of Ascend-related modifications.

## Proposal

### Objectives

Establish an Ascend CI pipeline by adding an Ascend CI workflow to GitHub Actions, continuously verifying operator correctness, automatically triggering tests on code changes, and promptly detecting regression issues.

### Technical Solution

#### 1. CI Workflow Design

Following the existing Intel and AMD CI configurations, create `.github/workflows/ascend-ci.yml`:

```yaml
name: Ascend NPU

on:
  push:
    branches:
      - main
    paths:
      - "src/**"
      - "test/**"
  pull_request:
    branches:
      - main
    paths:
      - "src/**"
      - "test/**"
  workflow_dispatch:  # Enable manual trigger

concurrency:
  group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
  cancel-in-progress: true

jobs:
  checkstyle:
    runs-on: ubuntu-latest
    steps:
    - name: Checkout code
      uses: actions/checkout@v6

    - name: Set up Python
      uses: actions/setup-python@v6
      with:
        python-version: '3.10'

    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install -r dev/fmt-requirements.txt

    - name: Run checkstyle
      run: make checkstyle

  tests:
    runs-on: [self-hosted, ascend-npu]  # Requires Ascend self-hosted runner configuration
    needs: [checkstyle]  # Wait for checkstyle job to complete
    if: success()  # Only run tests when checkstyle passes
    container:
      image: ascend/ascend-toolkit:latest  # Ascend container image with CANN, torch_npu, triton-ascend pre-installed
      options: --privileged -v /dev/davinci_manager:/dev/davinci_manager -v /dev/devmm_svm:/dev/devmm_svm -v /dev/hisi_hdc:/dev/hisi_hdc --ipc=host
    steps:
    - name: Checkout code
      uses: actions/checkout@v6

    - name: Set up Python
      shell: bash
      run: |
        # Python should be pre-installed in the container image
        python --version

    - name: Verify NPU availability
      shell: bash
      run: |
        npu-smi info
        echo "NPU devices available"

    - name: Setup Dependencies
      shell: bash
      run: |
        python -m pip install --upgrade pip
        pip install -e .[dev]
        # torch_npu and triton-ascend are pre-installed in the container image

    - name: List Python Environments
      shell: bash
      run: python -m pip list

    - name: Run Unit Tests
      shell: bash
      run: |
        # Run tests only after checkstyle passes
        # Initial phase: Run only test cases in test/transformers directory
        python -m pytest test/transformers/ --disable-warnings
        # Future phases will expand to full test suite:
        # make test
        # make test-convergence
```

**Note**: The actual configuration needs to be adjusted based on the specific Ascend runner environment, particularly:
- Runner label configuration (`self-hosted, ascend-npu`)
- Container image with CANN toolkit, torch_npu, and triton-ascend pre-installed
- Container options for NPU device access (device mounting, privileged mode, etc.)

#### 2. Runtime Environment Requirements

- **Hardware**: Ascend NPU device (e.g., Atlas 800I A2)
- **Container Image**: Pre-built Docker image containing:
  - Python 3.10+
  - PyTorch with torch_npu (2.7.1)
  - triton-ascend
  - Ascend CANN toolkit
- **Container Configuration**: Privileged mode and device mounting for NPU access

#### 3. Test Scope

**Initial Phase**: CI will execute test cases in the `test/transformers` directory to verify the forward and backward propagation correctness of Ascend-adapted operators.

**Future Phases**: Gradually expand to the full test suite:
- **Unit Tests** (`make test`): Verify all Ascend-adapted operators
- **Convergence Tests** (`make test-convergence`): Verify operator convergence on complete models

**Code Style Check** (`make checkstyle`) runs as an independent job at all times.

### Implementation Plan

1. **Infrastructure Preparation**: Configure Ascend self-hosted runner, install CANN toolkit and PyTorch + torch_npu environment
2. **CI Workflow Development**: Create `.github/workflows/ascend-ci.yml`, implement environment setup, dependency installation, and test execution

### Technical Details

#### Runner Configuration

Self-hosted runner configuration is required (GitHub-hosted runners do not provide Ascend hardware):
- **Labels**: `ascend-npu` or `ascend-910b4`
- **Operating System**: Ubuntu 20.04/22.04 (according to CANN requirements)
- **Hardware**: At least one Ascend device (e.g., Atlas 800I A2)

#### Environment Variables

CANN-related environment variables need to be set, such as `ASCEND_RT_VISIBLE_DEVICES`, etc.

#### Testing Strategy

- Test only Ascend-adapted operators to avoid CI failures from unadapted operators
- Use `pytest.mark.skip` or `xfail` for known issues
- If multiple NPUs are available, tests can be executed in parallel

### Future Work

- Integrate performance benchmarks in CI to track operator performance changes
- Extend to more Ascend device models (e.g., Ascend310, Ascend910A, etc.)
- Continuously adapt more operators to improve Ascend support coverage
- Supplement Ascend usage documentation, best practices, and troubleshooting guides

## Conclusion

Integrating Ascend CI will significantly improve Liger-Kernel's support quality for the Ascend platform, ensuring that code changes do not break Ascend operator functionality. Although some infrastructure investment is required, it will greatly reduce maintenance costs and improve community collaboration efficiency in the long run.

## Related Resources

- [Ascend CANN Documentation](https://www.hiascend.com/document)
- [torch_npu](https://gitcode.com/Ascend/pytorch)
- [triton-ascend](https://gitcode.com/Ascend/triton-ascend)


### Alternatives

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NPU]RFC: Ascend CI Integration #1022

Summary

Background and Motivation

Current Status

Completed Work

Problem Statement

Proposal

Objectives

Technical Solution

1. CI Workflow Design

2. Runtime Environment Requirements

3. Test Scope

Implementation Plan

Technical Details

Runner Configuration

Environment Variables

Testing Strategy

Future Work

Conclusion

Related Resources

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Kernel Name	File Path	Accuracy	Description
Cross Entropy	`cross_entropy.py`	✅	Cross-entropy loss function
DyT	`dyt.py`	✅	DyT normalization operation
Embedding	`experimental/embedding.py`	🟡	Embedding layer (experimental)
Fused Add RMS Norm	`fused_add_rms_norm.py`	✅	Fused Add + RMS Norm
Fused Linear Cross Entropy	`fused_linear_cross_entropy.py`	🟡	Fused Linear + Cross Entropy
Fused Linear JSD	`fused_linear_jsd.py`	🟡	Fused Linear + JSD
Fused Neighborhood Attention	`fused_neighborhood_attention.py`	🟡	Fused neighborhood attention
GEGLU	`geglu.py`	✅	GELU gated linear unit
Group Norm	`group_norm.py`	✅	Group normalization
GRPO Loss	`grpo_loss.py`	✅	GRPO loss function
JSD	`jsd.py`	🟡	Jensen-Shannon divergence
KL Div	`kl_div.py`	✅	KL divergence loss
Layer Norm	`layer_norm.py`	✅	Layer normalization
Llama4 ROPE	`llama4_rope.py`	🟡	Llama4 rotary position encoding
Multi Token Attention	`multi_token_attention.py`	🟡	Multi-token attention
Poly Norm	`poly_norm.py`	✅	Polynomial normalization
Qwen2VL MRope	`qwen2vl_mrope.py`	✅	Qwen2VL multi-rotary position encoding
RMS Norm	`rms_norm.py`	✅	RMS normalization
ROPE	`rope.py`	✅	Rotary position encoding
Softmax	`softmax.py`	✅	Softmax activation function
Sparsemax	`sparsemax.py`	✅	Sparsemax activation function
SWIGLU	`swiglu.py`	✅	SiLU gated linear unit
Tiled MLP	`tiled_mlp.py`	✅	Tiled MLP
TVD	`tvd.py`	✅	TVD loss function

[NPU]RFC: Ascend CI Integration #1022

Description

Summary

Background and Motivation

Current Status

Completed Work

Problem Statement

Proposal

Objectives

Technical Solution

1. CI Workflow Design

2. Runtime Environment Requirements

3. Test Scope

Implementation Plan

Technical Details

Runner Configuration

Environment Variables

Testing Strategy

Future Work

Conclusion

Related Resources

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions