Tags: basetenlabs/TensorRT-LLM
Tags
Mistral kv quant calibration fixes (#2) `examples/llama/hf_llama_convert.py -i cache/baseten_v0.6.1_20231206/repos/mistralai/Mistral-7B-v0.1/1a2e76/dst -o /tmp/tmp9frqc6v6/c5740d/dst --calibrate-kv-cache -t fp16` fails with similar errors to smooth quant errors because numpy doesn't support bfloat16.
Mistral smooth quant fixes (#1) * hf_llama_convert.py has many places where torch tensor is converted to numpy and numpy doesn't support bfloat16. Explicit conversion to get around that. * Fix issue with loading quantized weights where the calculation looks off.