Release GPTQModel v0.9.9 · ModelCloud/GPTQModel

What's Changed

Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang.

[CI] by @CSY-ModelCloud in #238, #236, #237, #241, #242, #243, #246, #247, #250
[FIX] explicitly call torch.no_grad() by @LRL-ModelCloud in #239
Bitblas update by @Qubitium in #249
[FIX] calib avg for calib dataset arg passed as tensors by @Qubitium, @LRL-ModelCloud in #254, #258
[MODEL] gemma2 27b can load with vLLM now by @LRL-ModelCloud in #257
[OPTIMIZE] to optimize vllm inference, set an environment variable 'VLLM_ATTENTI… by @LRL-ModelCloud in #260
[FIX] hard set batch_size to 1 for 4.43.0 transformer due to compat/regression by @LRL-ModelCloud in #279
FIX vllm llama 3.1 support by @Qubitium in #280
Use better defaults values for quantization config by @Qubitium in #281
[REFRACTOR] Cleanup backend and model_type usage by @LRL-ModelCloud in #276
[FIX] allow auto_round lm_head quantization by @LRL-ModelCloud in #282
[FIX] [MODEL] Llama-3.1-8B-Instruct's eos_token_id is a list by @CSY-ModelCloud in #284
[FIX] add release_vllm_model, and import destroy_model_parallel in release_vllm_model by @LRL-ModelCloud in #288
[FIX] autoround quants compat with vllm/sglang by @Qubitium in #287

Full Changelog: v0.9.8...v0.9.9