Open
Description
This is a tracker issue for all the different ways we can accelerate training / inference with activation sparsity in TorchAO.
Inference
- Accelerate memory-bound bs=1 decode use cases with a selective weight loading kernel, like that described in TEAL / CATS.

- Accelerate compute-bound bs=n prefill use cases with 2:4 activation sparsity, as we outlined in https://arxiv.org/pdf/2503.16672
- Add fast fused sparsification + fp8 rowwise + srelu kernels (2:4 activation sparsity packing kernels #2012)
- David also apparently has a triton kernel that does this, so we should benchmark and compare these two to see which one's faster.
- Add rowwise-fp8 + 2:4 sparse CUTLASS Kernel (Add CUTLASS-based row-wise scaled sparse FP8 kernel #1671)
- Add performance tuning configs for above kernel (Add config selection for row-wise scaled FP8 sparse CUTLASS-based kernel #1940)
- Add transposed support to the rowwise-fp8 sparse CUTLASS kernel. The above kernel assumes that the weight is 2:4 sparse. Since 2:4 sparsity is only supported for the first operand, I'm using the fact
$xW^T = (Wx^T)^T$ to be able to use the kernel for activation sparsity, but this means that the output of the kernel is in col-major format instead of row-major.
Training
- Activation compression to accelerate 2:4 sparse training (#1920 activation sparsity + compression #2076) has an implementation that I need to benchmark / review.
- Implement custom sparse training kernels outlined in our ICLR paper. Lower priority for now.