Skip to content

[KernelGen][Nvidia] Add rot90 operator with Triton kernel#3556

Open
XDYuanzhuLee wants to merge 1 commit into
flagos-ai:masterfrom
XDYuanzhuLee:pr/rot90
Open

[KernelGen][Nvidia] Add rot90 operator with Triton kernel#3556
XDYuanzhuLee wants to merge 1 commit into
flagos-ai:masterfrom
XDYuanzhuLee:pr/rot90

Conversation

@XDYuanzhuLee
Copy link
Copy Markdown
Contributor

@XDYuanzhuLee XDYuanzhuLee commented May 28, 2026

Summary

Adds a Triton kernel for rot90. Triton kernel implementation for rot90.

Testing

  • Validated against reference on device via to_reference(inp, True)
  • Tested on: Nvidia, Tianshu, Muxi, Ascend, Hygon

Performance

Test command: pytest benchmark/test_rot90.py --level core (NVIDIA H20)

dtype Size k dims Torch Latency (ms) Gems Latency (ms) Speedup
float16 [64, 64] 1 [0, 1] 0.011264 0.007168 1.571
float16 [128, 128] 1 [0, 1] 0.010240 0.008192 1.250
float16 [256, 256] 1 [0, 1] 0.010240 0.009216 1.111
float16 [512, 512] 1 [0, 1] 0.010240 0.010240 1.000
float16 [1024, 1024] 1 [0, 1] 0.014336 0.017408 0.824
float16 [2048, 2048] 1 [0, 1] 0.026624 0.046080 0.578
float16 [100, 200] 1 [0, 1] 0.010240 0.008192 1.250
float16 [200, 400] 1 [0, 1] 0.010240 0.008192 1.250
float16 [400, 800] 1 [0, 1] 0.011264 0.011264 1.000
float32 [64, 64] 1 [0, 1] 0.009216 0.008192 1.125
float32 [128, 128] 1 [0, 1] 0.010240 0.008192 1.250
float32 [256, 256] 1 [0, 1] 0.010240 0.008192 1.250
float32 [512, 512] 1 [0, 1] 0.011264 0.011264 1.000
float32 [1024, 1024] 1 [0, 1] 0.016384 0.020480 0.800
float32 [2048, 2048] 1 [0, 1] 0.036864 0.058368 0.632
float32 [100, 200] 1 [0, 1] 0.010240 0.008192 1.250
float32 [200, 400] 1 [0, 1] 0.010240 0.009216 1.111
float32 [400, 800] 1 [0, 1] 0.012288 0.012288 1.000
bfloat16 [64, 64] 1 [0, 1] 0.009216 0.007168 1.286
bfloat16 [128, 128] 1 [0, 1] 0.009216 0.008192 1.125
bfloat16 [256, 256] 1 [0, 1] 0.010240 0.009216 1.111
bfloat16 [512, 512] 1 [0, 1] 0.011264 0.010240 1.100
bfloat16 [1024, 1024] 1 [0, 1] 0.015360 0.017408 0.882
bfloat16 [2048, 2048] 1 [0, 1] 0.027648 0.045056 0.614
bfloat16 [100, 200] 1 [0, 1] 0.010240 0.008192 1.250
bfloat16 [200, 400] 1 [0, 1] 0.010240 0.009216 1.111
bfloat16 [400, 800] 1 [0, 1] 0.011264 0.011264 1.000
Arithmetic Mean 1.075

Multi-backend Testing

Backend Accuracy Test Speedup (mean) Notes
Nvidia (H20) PASS (27 cases) 1.075 Primary
Tianshu PASS 1.242
Muxi PASS 1.151
Ascend PASS
Hygon FAIL 0.911 error: operand #0 does not dominate this use

Files Changed

  • src/flag_gems/ops/rot90.py: Triton kernel implementation
  • tests/test_rot90.py: Accuracy test
  • benchmark/test_rot90.py: Performance benchmark
  • src/flag_gems/ops/__init__.py: Register import and __all__
  • src/flag_gems/__init__.py: Register to _FULL_CONFIG
  • conf/operators.yaml: Add operator entry (kind: Math, stage: alpha 5.1)

Add Triton kernel implementation for torch.rot90 on 2D tensors.
Includes accuracy tests and performance benchmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@XDYuanzhuLee XDYuanzhuLee changed the title [Operator] Add rot90 [KernelGen][Nvidia] Add rot90 operator with Triton kernel May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant