Hi guys, first of all, incredible work👍
Just a quick design question about the LLM.int8 activation quantization path.
Context
In MatMul8bitLt.forward, activations A are always cast to FP16 before quantization:
#bitsandbytes/autograd/_functions.py::MatMul8bitLt.forward
CA, SCA, outlier_cols = F.int8_vectorwise_quant(A.to(torch.float16), threshold=state.threshold)
and the CUDA kernel implementation currently hard-requires FP16:
# bitsandbytes/backends/cuda/ops.py
@register_kernel("bitsandbytes::int8_vectorwise_quant", "cuda")
def _(A, threshold=0.0):
torch._check(A.dtype == torch.float16, ...)
lib.cint8_vector_quant(get_ptr(A), ...)
On the native side, the exported ABI and launcher are also half-only:
csrc/pythonInterface.cpp: cint8_vector_quant(half* A, ...)
csrc/ops.cu: int8VectorQuant(half* A, ...)
csrc/kernels.cu: instantiations only for half
So BF16 inputs get an extra bf16 -> fp16 cast + warning even though the rest of the pipeline tries to preserve output dtype (int8_scaled_mm(..., dtype=A.dtype)).
Question
Was the FP16-only design for int8_vectorwise_quant / LLM.int8 activation quantization intentional (e.g. for kernel simplicity, CUB reduction constraints, arch compatibility), or is it mainly an unimplemented gap?
I’m asking because I see a lot of LLM inference/training frameworks use BF16 activations by default these days, and this path currently forces an FP16 cast for quantization.