[KernelGen][Nvidia] Add special_softmax operator with Triton kernel by bwbwzzz · Pull Request #3575 · flagos-ai/FlagGems

bwbwzzz · 2026-05-28T10:53:36Z

Summary

Adds a Triton kernel for special_softmax. Applies the softmax function to an n-dimensional input Tensor.

Test command: pytest benchmark/test_special_softmax.py --level core (NVIDIA H20)

Configuration	Torch Latency (ms)	Gems Latency (ms)	Speedup	TFLOPS
(4096, 4096), float32	0.094	0.036	2.62	—
(1024, 1024, 1024), float32	2.681	1.596	1.68	—
(64, 512, 512), float32	0.047	0.043	1.10	—
(4096, 4096), float16	0.093	0.045	2.05	—
(1024, 1024, 1024), float16	2.975	2.204	1.35	—
Arithmetic Mean	—	—	1.42	—

Backend	Accuracy Test	Benchmark	Speedup (mean)	Notes
Nvidia (H20)	PASS	PASS (15 cases, --level core)	1.42	Primary
Tianshu	PASS	PASS (15 cases)	1.641	—
Muxi	PASS	PASS (15 cases)	1.014	—
Ascend	PASS	N/A	—	—
Hygon	PASS	PASS (15 cases)	1.291	—

[KernelGen][Nvidia] Add special_softmax operator with Triton kernel

67bb15f

bwbwzzz requested review from 0x45f, bin913, douxetpur, huangyiqun and w1120029931-bit as code owners May 28, 2026 10:53

github-actions Bot added benchmark ops/aten core tests size/Medium KernelGen labels May 28, 2026