feat(op): Paged attention example using pl.function frontend syntax #139

Crystal-wzy · 2026-02-06T06:58:36Z

Add paged attention implementation using PyPTO pl.function frontend with 4 kernels:

qk_matmul (AIC): Q @ K^T matrix multiplication
softmax_prepare (AIV): scale, rowmax, exp, rowsum
pv_matmul (AIC): P @ V matrix multiplication
online_update (AIV): online softmax accumulation + fused normalization

Co-authored-by: Claude Opus 4.5 noreply@anthropic.com

gemini-code-assist · 2026-02-06T06:58:49Z

Summary of Changes

Hello @Crystal-wzy, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request adds a comprehensive example of Paged Attention, a technique used in large language models, implemented within the PyPTO framework. It demonstrates how to structure complex computational graphs into modular kernels using PyPTO's pl.function frontend. The example highlights both matrix multiplication and vector operations, with a particular focus on showcasing advanced control flow patterns using Python's native if/else constructs translated into PyPTO's intermediate representation.

Highlights

New Paged Attention Example: Introduces a new example for Paged Attention, implemented using the PyPTO pl.function frontend, showcasing a common deep learning mechanism.
Four Core Kernels: The implementation is broken down into four distinct kernels: qk_matmul (Q @ K^T), softmax_prepare (scale, rowmax, exp, rowsum), pv_matmul (P @ V), and online_update (online softmax accumulation + fused normalization).
Conditional Control Flow Demonstration: The online_update kernel specifically demonstrates the use of Python native if/else statements combined with pl.yield_() to achieve SSA phi node semantics for conditional control flow within PyPTO.
MLIR Compilation and Testing: The example includes functionality to compile the Paged Attention program to MLIR (Multi-Level Intermediate Representation) and run basic verification tests to ensure all four kernels are correctly generated.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

examples/ir_parser/paged_attention.py
- Added a new file implementing Paged Attention using PyPTO's pl.function frontend.
- Defined the PagedAttention class as a pl.program containing four pl.function kernels: qk_matmul, softmax_prepare, pv_matmul, and online_update.
- Implemented qk_matmul for query-key matrix multiplication, including loading, moving to L0A/L0B with transpose, and matrix multiplication.
- Implemented softmax_prepare for scaling, row-wise maximum, exponential, and row-wise sum operations.
- Implemented pv_matmul for probability-value matrix multiplication.
- Implemented online_update to handle online softmax accumulation and fused normalization, demonstrating conditional logic with if/else and pl.yield_() for SSA semantics.
- Included a main execution block (if __name__ == "__main__":) for parsing arguments to compile the program to MLIR or run verification tests.

Activity

No human activity (comments, reviews) has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

The pull request introduces a comprehensive example of Paged Attention using the PyPTO pl.function frontend. It effectively demonstrates the breakdown of a complex operation into four distinct kernels, showcasing the use of pl.Tensor, pl.Scalar, and various pl.op.block operations. The implementation also provides a good illustration of conditional control flow with pl.yield_() for SSA phi node semantics. The inclusion of a main block for compilation and testing adds significant value by providing a runnable demonstration of the framework's capabilities.

gemini-code-assist · 2026-02-06T07:00:14Z

examples/ir_parser/paged_attention.py

+# Copyright (c) PyPTO Contributors.
+# Paged Attention implementation using PyPTO
+#
+# Reference: /data/w00949583/simpler/examples/paged_attention_sim/kernels/


The reference path /data/w00949583/simpler/examples/paged_attention_sim/kernels/ appears to be an absolute path specific to a local development environment. For better portability and maintainability, consider making this a relative path within the repository or a more generic reference if it's intended for public consumption.

gemini-code-assist · 2026-02-06T07:00:14Z

examples/ir_parser/paged_attention.py

+TILE_M = 16  # num_heads tile size
+TILE_N = 16  # block_size / head_dim_chunk tile size
+SCALE = 0.0884  # 1/sqrt(head_dim) = 1/sqrt(128) ~ 0.0884


The SCALE constant is derived from 1/sqrt(head_dim) = 1/sqrt(128). It would improve clarity and maintainability if head_dim (e.g., HEAD_DIM = 128) was defined as a constant, and SCALE was calculated from it. This makes the relationship explicit and easier to modify if head_dim changes.

Suggested change

TILE_M = 16 # num_heads tile size

TILE_N = 16 # block_size / head_dim_chunk tile size

SCALE = 0.0884 # 1/sqrt(head_dim) = 1/sqrt(128) ~ 0.0884

TILE_M = 16 # num_heads tile size

TILE_N = 16 # block_size / head_dim_chunk tile size

HEAD_DIM = 128

SCALE = 1 / (HEAD_DIM**0.5) # 1/sqrt(head_dim)

gemini-code-assist · 2026-02-06T07:00:14Z

examples/ir_parser/paged_attention.py

+            li_updated = pl.op.block.add(li_scaled, lij_scaled)
+
+            # Update accumulated output: oi = alpha * oi + beta * oi_new
+            # Use row_expand_mul for broadcasting [M,1] * [M,N]


The comment # Use row_expand_mul for broadcasting [M,1] * [M,N] is slightly misleading. row_expand_mul performs element-wise multiplication where the [M,1] vector is broadcast across the columns of the [M,N] tile. Clarifying the comment to reflect this broadcasting behavior would be helpful.

Suggested change

# Use row_expand_mul for broadcasting [M,1] * [M,N]

# Use row_expand_mul for broadcasting [M,1] across [M,N]

gemini-code-assist · 2026-02-06T07:00:14Z

examples/ir_parser/paged_attention.py

+        import os
+        os.makedirs("/data/w00949583/pypto/build_output", exist_ok=True)
+        output_path = "/data/w00949583/pypto/build_output/paged_attention.mlir"


The output directory and filename for the generated MLIR code are hardcoded. To make this example more flexible and reusable in different environments or for different test cases, consider making these paths configurable, perhaps through command-line arguments or environment variables.

Suggested change

import os

os.makedirs("/data/w00949583/pypto/build_output", exist_ok=True)

output_path = "/data/w00949583/pypto/build_output/paged_attention.mlir"

# Save to file

import os

output_dir = "./build_output"

os.makedirs(output_dir, exist_ok=True)

output_filename = "paged_attention.mlir"

output_path = os.path.join(output_dir, output_filename)

with open(output_path, "w") as f:

Add paged attention implementation using PyPTO pl.function frontend with 4 kernels: - qk_matmul (AIC): Q @ K^T matrix multiplication - softmax_prepare (AIV): scale, rowmax, exp, rowsum - pv_matmul (AIC): P @ V matrix multiplication - online_update (AIV): online softmax accumulation + fused normalization The online_update kernel demonstrates conditional control flow using Python native if/else with pl.yield_() for SSA phi node semantics. Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

gemini-code-assist bot reviewed Feb 6, 2026

View reviewed changes

lyfne123 changed the title ~~Add: Paged attention example using pl.function frontend syntax~~ feat(op): Paged attention example using pl.function frontend syntax Feb 6, 2026

Crystal-wzy force-pushed the main branch from a8dfda9 to ba279d2 Compare February 13, 2026 07:34

Crystal-wzy closed this Feb 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(op): Paged attention example using pl.function frontend syntax #139

feat(op): Paged attention example using pl.function frontend syntax #139

Uh oh!

Crystal-wzy commented Feb 6, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

gemini-code-assist bot Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	# Use row_expand_mul for broadcasting [M,1] * [M,N]
	# Use row_expand_mul for broadcasting [M,1] across [M,N]

-        import os
-        os.makedirs("/data/w00949583/pypto/build_output", exist_ok=True)
-        output_path = "/data/w00949583/pypto/build_output/paged_attention.mlir"
+        # Save to file
+        import os
+        output_dir = "./build_output"
+        os.makedirs(output_dir, exist_ok=True)
+        output_filename = "paged_attention.mlir"
+        output_path = os.path.join(output_dir, output_filename)
+        with open(output_path, "w") as f:

feat(op): Paged attention example using pl.function frontend syntax #139

feat(op): Paged attention example using pl.function frontend syntax #139

Uh oh!

Conversation

Crystal-wzy commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Crystal-wzy commented Feb 6, 2026 •

edited

Loading