Closed
Description
🚀 The feature, motivation and pitch
Similar to the recent discoveries in #18844, vectorizing our quantization methods can have a huge impact on e2e performance.
Currently we only use vectorization.h
in csrc/quantization/fp8/common.cuh
and csrc/quantization/fused_kernels/layernorm_utils.cuh
, so we should expand this to more implementations like csrc/quantization/compressed_tensors/int8_quant_kernels.cu
for faster INT8 activation quantization.
Alternatives
No response
Additional context
No response
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.