🚀 The feature, motivation and pitch
#24914 moved the query quantization out of the FlashAttn backend into attention/layer such that it can be fused by torch compile AND included in the CUDA graph.
#26534 moved it for Triton and FlashInfer.
This should be done for all backends.
Alternatives
No response
Additional context
No response
Before submitting a new issue...