Skip to content

[KernelGen][Nvidia] Add special_softmax operator with Triton kernel#3575

Open
bwbwzzz wants to merge 1 commit into
flagos-ai:masterfrom
bwbwzzz:pr/nv-special_softmax
Open

[KernelGen][Nvidia] Add special_softmax operator with Triton kernel#3575
bwbwzzz wants to merge 1 commit into
flagos-ai:masterfrom
bwbwzzz:pr/nv-special_softmax

Conversation

@bwbwzzz
Copy link
Copy Markdown

@bwbwzzz bwbwzzz commented May 28, 2026

Summary

Adds a Triton kernel for special_softmax. Applies the softmax function to an n-dimensional input Tensor.

Testing

  • Parametrized tests over dim and dtype
  • Validated against PyTorch reference on device
  • Tested on: Nvidia, Tianshu, Muxi, Ascend, Hygon

Performance

Test command: pytest benchmark/test_special_softmax.py --level core (NVIDIA H20)

Configuration Torch Latency (ms) Gems Latency (ms) Speedup TFLOPS
(4096, 4096), float32 0.094 0.036 2.62
(1024, 1024, 1024), float32 2.681 1.596 1.68
(64, 512, 512), float32 0.047 0.043 1.10
(4096, 4096), float16 0.093 0.045 2.05
(1024, 1024, 1024), float16 2.975 2.204 1.35
Arithmetic Mean 1.42

Multi-backend Testing

Backend Accuracy Test Benchmark Speedup (mean) Notes
Nvidia (H20) PASS PASS (15 cases, --level core) 1.42 Primary
Tianshu PASS PASS (15 cases) 1.641
Muxi PASS PASS (15 cases) 1.014
Ascend PASS N/A
Hygon PASS PASS (15 cases) 1.291

Files Changed

  • src/flag_gems/ops/special_softmax.py: Triton kernel implementation
  • tests/test_special_softmax.py: Accuracy test
  • benchmark/test_special_softmax.py: Performance benchmark
  • src/flag_gems/ops/__init__.py: Register import and __all__
  • src/flag_gems/__init__.py: Register to _FULL_CONFIG
  • conf/operators.yaml: Add operator entry (kind: Math, stage: alpha 5.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants