Added LongRoPe Model Causal Mask Pattern Fusion #2473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Draft

tadani3 wants to merge 29 commits into microsoft:main from tadani3:longrope_attention_causal_mask

tadani3 commented Aug 1, 2025

This PR introduces a specialized LongRoPe (Long Range Rotary Position Embedding) GQA (Group Query Attention) causal mask fusion rule specifically designed for Phi-4-mini-reasoning and similar models. The implementation optimizes attention mask computation for models using sliding window attention with LongRoPe position embeddings.

New LongRoPeGQACausalMask Class

Specialized Pattern Matching: Implements complex pattern matching for LongRoPe attention mechanisms with sliding window support.
Mask Caching: Introduces caching using _get_mask_key() to avoid recomputation of expensive mask operations across layers.
Sliding Window Support: Handles configurable sliding window sizes (currently hardcoded to 262144) for long-context attention.

Advanced Mask Computation

Multi-Branch Processing: Implements three parallel branches for KV range, query range, and batch processing.
Efficient Range Operations: Uses optimized tensor operations for creating position-based masks.
Boolean Logic Optimization: Combines sliding window masks with attention mask lookups using efficient boolean operations.

Note: This PR is meant to replace #2461 by introducing the requested changes.

tadani3 and others added 13 commits

July 24, 2025 02:35


          Added Causal Mask Pattern Fusion for LongRoPe Models

7bd391d


          Added Phi4-mini-reasoning cache insertion and position Id deletion logic

f0f41a8


          Merge branch 'main' into longrope_causal_mask

189d0c8


          Removed whitespace from gqa longrope fusion

758e92d


          Added docstrings to GQA pattern method

d4a8c57


          Renamed pattern branches to match kv_range, query_range, and batch_ra…

30faab7

…nge computation


          Merge branch 'longrope_causal_mask' of https://github.com/tadani3/onn…

01e37b3

…xscript into longrope_causal_mask


          Removed unecessary pattern variable

912a80b


          Added snake casing for variable names

fd95719


          Added more snake casing and removed uneeded code

19d2656


          Moved get_mask_key method to module level and used IR value directly

0742db2


          Added cleanup method for the attention mask cache

2772f77


          Added LongRoPE GQA Causal Mask Fusion Separately

87a0464

github-project-automation bot added this to ONNX Script Review Board

github-project-automation bot moved this to Todo in ONNX Script Review Board

github-advanced-security bot found potential problems

View reviewed changes

onnxscript/rewriter/ort_fusions/gqa.py

    
                      """

                      Pattern for LongRoPe GQA Causal Mask.

                      This pattern computes the causal mask for Group Query Attention with LongRoPe.

                      It constructs the mask based on input_ids and past_kv_cache, and handles the

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable total_seq_length_int32 is not used.

onnxscript/rewriter/ort_fusions/gqa.py

    
                      """

                      Pattern for LongRoPe GQA Causal Mask.

                      This pattern computes the causal mask for Group Query Attention with LongRoPe.

                      It constructs the mask based on input_ids and past_kv_cache, and handles the

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable seqlens_k_int32 is not used.

onnxscript/rewriter/ort_fusions/longrope_gqa.py

    
                      mask_key = _get_mask_key(attention_mask)

                      if mask_key in self._mask_cache:

                          total_seq_length_int32, seqlens_k_int32 = self._mask_cache[mask_key]

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable total_seq_length_int32 is not used.

onnxscript/rewriter/ort_fusions/longrope_gqa.py

    
                      mask_key = _get_mask_key(attention_mask)

                      if mask_key in self._mask_cache:

                          total_seq_length_int32, seqlens_k_int32 = self._mask_cache[mask_key]

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable seqlens_k_int32 is not used.

onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

    
              # Licensed under the MIT License.  See License.txt in the project root for

              # license information.

              # --------------------------------------------------------------------------

              import onnx

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'onnx' is not used.

onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

    
              # --------------------------------------------------------------------------

              import onnx

              from onnxscript import ir

              import onnx.helper

Check notice

Code scanning / CodeQL

Unused import Note

Import of 'onnx' is not used.

onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

    
                      cache_length = self.rotemb_attrs["cache_length"]

                      position_ids = torch.arange(cache_length, dtype=torch.int64).unsqueeze(0)  # Shape: (1, cache_length)

                      inv_freq_expanded = inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)  # (1, dim//2, 1)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error

Local variable 'inv_freq' may be used before it is initialized.

onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

    
                      with torch.autocast(device_type=device_type, enabled=False):

                          freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)  # (1, cache_length, dim//2)

                          emb = torch.cat((freqs, freqs), dim=-1)  # (1, cache_length, dim)

                          cos_cache = emb.cos() * attention_factor  # (1, cache_length, dim)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error

Local variable 'attention_factor' may be used before it is initialized.

onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

    
                              attention_factor = self.rotemb_attrs["multi_cache"]["short_mscale"]

                      inv_freq_shape = torch.arange(0, dim, 2, dtype=torch.int64, device="cpu").float() / dim

                      inv_freq = 1.0 / (ext_factors * base**inv_freq_shape)

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error

Local variable 'ext_factors' may be used before it is initialized.

onnxscript/rewriter/phi4_mini_reasoning_post_processor.py

    
                      if "rescale_inv_freq" in self.rotemb_attrs:

                          inv_freq = self.make_inv_freq_rescaled(inv_freq)

                      return inv_freq, attention_factor

Check failure

Code scanning / CodeQL

Potentially uninitialized local variable Error

Local variable 'attention_factor' may be used before it is initialized.

github-advanced-security bot found potential problems

View reviewed changes

Contributor

github-advanced-security bot left a comment

lintrunner found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

tadani3 and others added 13 commits

August 1, 2025 17:40


          Removed whitespace from gqa longrope fusion

f12630c


          Added docstrings to GQA pattern method


          Renamed pattern branches to match kv_range, query_range, and batch_ra…

e59cb83

…nge computation


          Remove DORT related tests since it was removed from PyTorch (microsof…

bad7811

…t#2465)

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>


          Handle matching against None explicitly (microsoft#2460)

19f5e65

Provide a way to indicate that a pattern-variable can match successfully
against a None-valued input. Cleanup current handling which was
inconsistent in one place. Add test cases.

---------

Signed-off-by: Ganesan Ramalingam <grama@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>


          [docs] Document rewriter pattern options (microsoft#2406)

17c117f

This PR adds comprehensive documentation for the rewriter pattern
options that were previously undocumented. The rewriter pattern system
supports four key options for controlling pattern matching and
replacement behavior:

## New Documentation Added

### `_allow_other_inputs` option
- **File**: `docs/tutorial/rewriter/allow_other_inputs.md`
- **Purpose**: Controls whether patterns can match nodes with additional
inputs beyond those specified
- **Default**: `False` (exact input matching)
- **Example**: Matching `Conv` operations that may have optional bias
inputs

```python
def conv_pattern(op, input, weight):
    # Matches Conv with 2 or 3 inputs (weight + optional bias)
    return op.Conv(input, weight, _allow_other_inputs=True)
```

### `_domain` option  
- **File**: `docs/tutorial/rewriter/domain_option.md`
- **Purpose**: Specifies operator domains for pattern matching and
replacement
- **Use cases**: Domain-specific rewrites, migrating between operator
domains
- **Example**: Targeting operations from specific domains like
"com.microsoft"

```python
def custom_relu_pattern(op, input):
    # Only matches Relu from custom domain
    return op.Relu(input, _domain="custom.domain")
```

### `_outputs` option
- **File**: `docs/tutorial/rewriter/outputs_option.md` 
- **Purpose**: Specifies number and names of operation outputs
- **Formats**: Integer count (`_outputs=2`) or named list
(`_outputs=["first", "second"]`)
- **Example**: Handling multi-output operations like `Split`

```python
def split_pattern(op, input):
    # Matches Split operations with exactly 2 outputs
    return op.Split(input, num_outputs=2, axis=0, _outputs=2)
```

### Enhanced `_allow_other_attributes` documentation
- **File**: `docs/tutorial/rewriter/attributes.md` (improved formatting)
- **Already documented**: Controls whether patterns match nodes with
additional attributes
- **Default**: `True` (allows extra attributes)

## Documentation Structure Improvements

- Added "Pattern Options" section to main rewriter documentation
- Integrated all option docs into the tutorial flow
- Created working code examples for each option
- Followed existing documentation patterns and style
- All examples compile and run successfully
- Documentation builds correctly with Sphinx

The documentation now provides complete coverage of all rewriter pattern
options with practical examples showing real-world usage patterns.

Fixes microsoft#2405.

> [!WARNING]
>
> <details>
> <summary>Firewall rules blocked me from connecting to one or more
addresses</summary>
>
> #### I tried to connect to the following addresses, but was blocked by
firewall rules:
>
> - `docs.python.org`
> - Triggering command: `python -m sphinx docs dist/html -W -q ` (dns
block)
> - Triggering command: `python -m sphinx docs dist/html -q -E -j 1 `
(dns block)
> - `docs.scipy.org`
> - Triggering command: `python -m sphinx docs dist/html -W -q ` (dns
block)
> - Triggering command: `python -m sphinx docs dist/html -q -E -j 1 `
(dns block)
> - `matplotlib.org`
> - Triggering command: `python -m sphinx docs dist/html -W -q ` (dns
block)
> - Triggering command: `python -m sphinx docs dist/html -q -E -j 1 `
(dns block)
> - `numpy.org`
> - Triggering command: `python -m sphinx docs dist/html -W -q ` (dns
block)
> - Triggering command: `python -m sphinx docs dist/html -q -E -j 1 `
(dns block)
> - `onnx.ai`
> - Triggering command: `python -m sphinx docs dist/html -W -q ` (dns
block)
> - Triggering command: `python -m sphinx docs dist/html -q -E -j 1 `
(dns block)
> - `onnxruntime.ai`
> - Triggering command: `python -m sphinx docs dist/html -W -q ` (dns
block)
> - Triggering command: `python -m sphinx docs dist/html -q -E -j 1 `
(dns block)
> - `pytorch.org`
> - Triggering command: `python -m sphinx docs dist/html -W -q ` (dns
block)
> - Triggering command: `python -m sphinx docs dist/html -q -E -j 1 `
(dns block)
>
> If you need me to access, download, or install something from one of
these locations, you can either:
>
> - Configure [Actions setup
steps](https://gh.io/copilot/actions-setup-steps) to set up my
environment, which run before the firewall is enabled
> - Add the appropriate URLs or hosts to my [firewall allow
list](https://gh.io/copilot/firewall-config)
>
> </details>



<!-- START COPILOT CODING AGENT TIPS -->
---

💬 Share your feedback on Copilot coding agent for the chance to win a
$200 gift card! Click
[here](https://survey.alchemer.com/s3/8343779/Copilot-Coding-agent) to
start the survey.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: justinchuby <11205048+justinchuby@users.noreply.github.com>
Co-authored-by: gramalingam <10075881+gramalingam@users.noreply.github.com>


          Update requirements-ort-nightly.txt (microsoft#2471)

3fb87c0


          Fix logic for converting np array to text (microsoft#2470)

127aee8

In onnx2script, nan, inf etc. were converted to plain text, which causes
evaluation to fail because they don't exist in the script. I updated the
logic to replace them with np. values.

---------

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>


          [torchlib] Improves aten_chunk conversion (microsoft#2469)

131e497

Simplify implementation for `aten_chunk` and allow it to work on all
data types.

Original author: @xadupre 
Updated: Conditionally use the new implementation when torch>=2.7

---------

Signed-off-by: Justin Chu <justinchuby@users.noreply.github.com>
Co-authored-by: Xavier Dupré <xadupre@users.noreply.github.com>


          Removed unecessary pattern variable

acdfd1b


          Added snake casing for variable names

76624ad


          Added more snake casing and removed uneeded code

fbb191a


          Moved get_mask_key method to module level and used IR value directly

f295bc5

tadani3 added 3 commits

August 1, 2025 17:40


          Added cleanup method for the attention mask cache

0334bb1


          Added LongRoPE GQA Causal Mask Fusion Separately

74e8e24


          Merge branch 'longrope_attention_causal_mask' of https://github.com/t…

d5383f0

…adani3/onnxscript into longrope_attention_causal_mask

codecov bot commented Aug 1, 2025 •

edited

Loading

Codecov Report

❌ Patch coverage is 41.01509% with 430 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.01%. Comparing base (da23d76) to head (d5383f0).

Files with missing lines	Patch %	Lines
...ipt/rewriter/phi4_mini_reasoning_post_processor.py	18.24%	345 Missing ⚠️
onnxscript/rewriter/ort_fusions/longrope_gqa.py	65.42%	65 Missing ⚠️
onnxscript/rewriter/ort_fusions/gqa.py	83.19%	20 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2473      +/-   ##
==========================================
- Coverage   69.81%   69.01%   -0.81%     
==========================================
  Files         209      211       +2     
  Lines       25313    25978     +665     
  Branches     2525     2612      +87     
==========================================
+ Hits        17673    17928     +255     
- Misses       6762     7175     +413     
+ Partials      878      875       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

tadani3 marked this pull request as draft

August 1, 2025 17:54

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet