Open
Description
Dear LLMC team,
I've been trying to run mixed-precision PTQ quantization using RTN.
I suspect there's a bug, as the non-default settings in mix_bits
are ignored.
My understanding of the code:
- In method
get_act_qparams()
ofrtn.py
, the values ofqmax
/qmin
/scales
/zeros
are determined using the default quantizer bit precision - These values are registered as
buf_act_<xxx>
buffers, for all modules / layers. - During inference time, in method
a_qdq()
ofrtn.py
, though theaquantizer
object of each layer is configured correctly, it blindly loads from buffer the registered quantization parametersqmin
/qmax
/scales
/zeros
, and uses them, instead of the actual values it should support.
What do you think?
Thanks in advance!