Skip to content

Question: intentional FP16-only path for int8_vectorwise_quant / LLM.int8 activation quant? (BF16 support + removing casts) #1868

@sanghyunna

Description

@sanghyunna

Hi guys, first of all, incredible work👍

Just a quick design question about the LLM.int8 activation quantization path.

Context

In MatMul8bitLt.forward, activations A are always cast to FP16 before quantization:

#bitsandbytes/autograd/_functions.py::MatMul8bitLt.forward
CA, SCA, outlier_cols = F.int8_vectorwise_quant(A.to(torch.float16), threshold=state.threshold)

and the CUDA kernel implementation currently hard-requires FP16:

# bitsandbytes/backends/cuda/ops.py
@register_kernel("bitsandbytes::int8_vectorwise_quant", "cuda")
def _(A, threshold=0.0):
    torch._check(A.dtype == torch.float16, ...)
    lib.cint8_vector_quant(get_ptr(A), ...)

On the native side, the exported ABI and launcher are also half-only:

  • csrc/pythonInterface.cpp: cint8_vector_quant(half* A, ...)
  • csrc/ops.cu: int8VectorQuant(half* A, ...)
  • csrc/kernels.cu: instantiations only for half
    So BF16 inputs get an extra bf16 -> fp16 cast + warning even though the rest of the pipeline tries to preserve output dtype (int8_scaled_mm(..., dtype=A.dtype)).

Question

Was the FP16-only design for int8_vectorwise_quant / LLM.int8 activation quantization intentional (e.g. for kernel simplicity, CUB reduction constraints, arch compatibility), or is it mainly an unimplemented gap?

I’m asking because I see a lot of LLM inference/training frameworks use BF16 activations by default these days, and this path currently forces an FP16 cast for quantization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions