Skip to content

Conversation

@Crystal-wzy
Copy link

@Crystal-wzy Crystal-wzy commented Feb 6, 2026

Add paged attention implementation using PyPTO pl.function frontend with 4 kernels:

  • qk_matmul (AIC): Q @ K^T matrix multiplication
  • softmax_prepare (AIV): scale, rowmax, exp, rowsum
  • pv_matmul (AIC): P @ V matrix multiplication
  • online_update (AIV): online softmax accumulation + fused normalization

Co-authored-by: Claude Opus 4.5 noreply@anthropic.com

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Crystal-wzy, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a comprehensive example of Paged Attention, a technique used in large language models, implemented within the PyPTO framework. It demonstrates how to structure complex computational graphs into modular kernels using PyPTO's pl.function frontend. The example highlights both matrix multiplication and vector operations, with a particular focus on showcasing advanced control flow patterns using Python's native if/else constructs translated into PyPTO's intermediate representation.

Highlights

  • New Paged Attention Example: Introduces a new example for Paged Attention, implemented using the PyPTO pl.function frontend, showcasing a common deep learning mechanism.
  • Four Core Kernels: The implementation is broken down into four distinct kernels: qk_matmul (Q @ K^T), softmax_prepare (scale, rowmax, exp, rowsum), pv_matmul (P @ V), and online_update (online softmax accumulation + fused normalization).
  • Conditional Control Flow Demonstration: The online_update kernel specifically demonstrates the use of Python native if/else statements combined with pl.yield_() to achieve SSA phi node semantics for conditional control flow within PyPTO.
  • MLIR Compilation and Testing: The example includes functionality to compile the Paged Attention program to MLIR (Multi-Level Intermediate Representation) and run basic verification tests to ensure all four kernels are correctly generated.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • examples/ir_parser/paged_attention.py
    • Added a new file implementing Paged Attention using PyPTO's pl.function frontend.
    • Defined the PagedAttention class as a pl.program containing four pl.function kernels: qk_matmul, softmax_prepare, pv_matmul, and online_update.
    • Implemented qk_matmul for query-key matrix multiplication, including loading, moving to L0A/L0B with transpose, and matrix multiplication.
    • Implemented softmax_prepare for scaling, row-wise maximum, exponential, and row-wise sum operations.
    • Implemented pv_matmul for probability-value matrix multiplication.
    • Implemented online_update to handle online softmax accumulation and fused normalization, demonstrating conditional logic with if/else and pl.yield_() for SSA semantics.
    • Included a main execution block (if __name__ == "__main__":) for parsing arguments to compile the program to MLIR or run verification tests.
Activity
  • No human activity (comments, reviews) has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a comprehensive example of Paged Attention using the PyPTO pl.function frontend. It effectively demonstrates the breakdown of a complex operation into four distinct kernels, showcasing the use of pl.Tensor, pl.Scalar, and various pl.op.block operations. The implementation also provides a good illustration of conditional control flow with pl.yield_() for SSA phi node semantics. The inclusion of a main block for compilation and testing adds significant value by providing a runnable demonstration of the framework's capabilities.

# Copyright (c) PyPTO Contributors.
# Paged Attention implementation using PyPTO
#
# Reference: /data/w00949583/simpler/examples/paged_attention_sim/kernels/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The reference path /data/w00949583/simpler/examples/paged_attention_sim/kernels/ appears to be an absolute path specific to a local development environment. For better portability and maintainability, consider making this a relative path within the repository or a more generic reference if it's intended for public consumption.

Comment on lines +18 to +20
TILE_M = 16 # num_heads tile size
TILE_N = 16 # block_size / head_dim_chunk tile size
SCALE = 0.0884 # 1/sqrt(head_dim) = 1/sqrt(128) ~ 0.0884
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SCALE constant is derived from 1/sqrt(head_dim) = 1/sqrt(128). It would improve clarity and maintainability if head_dim (e.g., HEAD_DIM = 128) was defined as a constant, and SCALE was calculated from it. This makes the relationship explicit and easier to modify if head_dim changes.

Suggested change
TILE_M = 16 # num_heads tile size
TILE_N = 16 # block_size / head_dim_chunk tile size
SCALE = 0.0884 # 1/sqrt(head_dim) = 1/sqrt(128) ~ 0.0884
TILE_M = 16 # num_heads tile size
TILE_N = 16 # block_size / head_dim_chunk tile size
HEAD_DIM = 128
SCALE = 1 / (HEAD_DIM**0.5) # 1/sqrt(head_dim)

li_updated = pl.op.block.add(li_scaled, lij_scaled)

# Update accumulated output: oi = alpha * oi + beta * oi_new
# Use row_expand_mul for broadcasting [M,1] * [M,N]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment # Use row_expand_mul for broadcasting [M,1] * [M,N] is slightly misleading. row_expand_mul performs element-wise multiplication where the [M,1] vector is broadcast across the columns of the [M,N] tile. Clarifying the comment to reflect this broadcasting behavior would be helpful.

Suggested change
# Use row_expand_mul for broadcasting [M,1] * [M,N]
# Use row_expand_mul for broadcasting [M,1] across [M,N]

Comment on lines +249 to +251
import os
os.makedirs("/data/w00949583/pypto/build_output", exist_ok=True)
output_path = "/data/w00949583/pypto/build_output/paged_attention.mlir"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The output directory and filename for the generated MLIR code are hardcoded. To make this example more flexible and reusable in different environments or for different test cases, consider making these paths configurable, perhaps through command-line arguments or environment variables.

Suggested change
import os
os.makedirs("/data/w00949583/pypto/build_output", exist_ok=True)
output_path = "/data/w00949583/pypto/build_output/paged_attention.mlir"
# Save to file
import os
output_dir = "./build_output"
os.makedirs(output_dir, exist_ok=True)
output_filename = "paged_attention.mlir"
output_path = os.path.join(output_dir, output_filename)
with open(output_path, "w") as f:

@lyfne123 lyfne123 changed the title Add: Paged attention example using pl.function frontend syntax feat(op): Paged attention example using pl.function frontend syntax Feb 6, 2026
  Add paged attention implementation using PyPTO pl.function frontend with 4 kernels:
  - qk_matmul (AIC): Q @ K^T matrix multiplication
  - softmax_prepare (AIV): scale, rowmax, exp, rowsum
  - pv_matmul (AIC): P @ V matrix multiplication
  - online_update (AIV): online softmax accumulation + fused normalization

  The online_update kernel demonstrates conditional control flow using
  Python native if/else with pl.yield_() for SSA phi node semantics.

  Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant