[KernelGen][Nvidia] Add rot90 operator with Triton kernel by XDYuanzhuLee · Pull Request #3556 · flagos-ai/FlagGems

XDYuanzhuLee · 2026-05-28T05:50:14Z

Summary

Adds a Triton kernel for rot90. Triton kernel implementation for rot90.

Testing

Validated against reference on device via to_reference(inp, True)
Tested on: Nvidia, Tianshu, Muxi, Ascend, Hygon

Performance

Test command: pytest benchmark/test_rot90.py --level core (NVIDIA H20)

dtype	Size	k	dims	Torch Latency (ms)	Gems Latency (ms)	Speedup
float16	[64, 64]	1	[0, 1]	0.011264	0.007168	1.571
float16	[128, 128]	1	[0, 1]	0.010240	0.008192	1.250
float16	[256, 256]	1	[0, 1]	0.010240	0.009216	1.111
float16	[512, 512]	1	[0, 1]	0.010240	0.010240	1.000
float16	[1024, 1024]	1	[0, 1]	0.014336	0.017408	0.824
float16	[2048, 2048]	1	[0, 1]	0.026624	0.046080	0.578
float16	[100, 200]	1	[0, 1]	0.010240	0.008192	1.250
float16	[200, 400]	1	[0, 1]	0.010240	0.008192	1.250
float16	[400, 800]	1	[0, 1]	0.011264	0.011264	1.000
float32	[64, 64]	1	[0, 1]	0.009216	0.008192	1.125
float32	[128, 128]	1	[0, 1]	0.010240	0.008192	1.250
float32	[256, 256]	1	[0, 1]	0.010240	0.008192	1.250
float32	[512, 512]	1	[0, 1]	0.011264	0.011264	1.000
float32	[1024, 1024]	1	[0, 1]	0.016384	0.020480	0.800
float32	[2048, 2048]	1	[0, 1]	0.036864	0.058368	0.632
float32	[100, 200]	1	[0, 1]	0.010240	0.008192	1.250
float32	[200, 400]	1	[0, 1]	0.010240	0.009216	1.111
float32	[400, 800]	1	[0, 1]	0.012288	0.012288	1.000
bfloat16	[64, 64]	1	[0, 1]	0.009216	0.007168	1.286
bfloat16	[128, 128]	1	[0, 1]	0.009216	0.008192	1.125
bfloat16	[256, 256]	1	[0, 1]	0.010240	0.009216	1.111
bfloat16	[512, 512]	1	[0, 1]	0.011264	0.010240	1.100
bfloat16	[1024, 1024]	1	[0, 1]	0.015360	0.017408	0.882
bfloat16	[2048, 2048]	1	[0, 1]	0.027648	0.045056	0.614
bfloat16	[100, 200]	1	[0, 1]	0.010240	0.008192	1.250
bfloat16	[200, 400]	1	[0, 1]	0.010240	0.009216	1.111
bfloat16	[400, 800]	1	[0, 1]	0.011264	0.011264	1.000
Arithmetic Mean	—	—	—	—	—	1.075

Multi-backend Testing

Backend	Accuracy Test	Speedup (mean)	Notes
Nvidia (H20)	PASS (27 cases)	1.075	Primary
Tianshu	PASS	1.242	—
Muxi	PASS	1.151	—
Ascend	PASS	—	—
Hygon	FAIL	0.911	error: operand #0 does not dominate this use

Files Changed

src/flag_gems/ops/rot90.py: Triton kernel implementation
tests/test_rot90.py: Accuracy test
benchmark/test_rot90.py: Performance benchmark
src/flag_gems/ops/__init__.py: Register import and __all__
src/flag_gems/__init__.py: Register to _FULL_CONFIG
conf/operators.yaml: Add operator entry (kind: Math, stage: alpha 5.1)

Add Triton kernel implementation for torch.rot90 on 2D tensors. Includes accuracy tests and performance benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

[Operator] Add rot90 operator

bd11416

Add Triton kernel implementation for torch.rot90 on 2D tensors. Includes accuracy tests and performance benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

XDYuanzhuLee requested review from 0x45f, bin913, douxetpur, huangyiqun and w1120029931-bit as code owners May 28, 2026 05:50

github-actions Bot added benchmark ops/aten core tests size/Medium labels May 28, 2026

XDYuanzhuLee changed the title ~~[Operator] Add rot90~~ [KernelGen][Nvidia] Add rot90 operator with Triton kernel May 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[KernelGen][Nvidia] Add rot90 operator with Triton kernel#3556

[KernelGen][Nvidia] Add rot90 operator with Triton kernel#3556
XDYuanzhuLee wants to merge 1 commit into
flagos-ai:masterfrom
XDYuanzhuLee:pr/rot90

XDYuanzhuLee commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

XDYuanzhuLee commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Performance

Multi-backend Testing

Files Changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XDYuanzhuLee commented May 28, 2026 •

edited

Loading