GPTQModel v0.9.9
What's Changed
Added Llama-3.1 support, Gemma2 27B quant inference support via vLLM, auto pad_token normalization, fixed auto-round quant compat for vLLM/SGLang.
- [CI] by @CSY-ModelCloud in #238, #236, #237, #241, #242, #243, #246, #247, #250
- [FIX] explicitly call torch.no_grad() by @LRL-ModelCloud in #239
- Bitblas update by @Qubitium in #249
- [FIX] calib avg for calib dataset arg passed as tensors by @Qubitium, @LRL-ModelCloud in #254, #258
- [MODEL] gemma2 27b can load with vLLM now by @LRL-ModelCloud in #257
- [OPTIMIZE] to optimize vllm inference, set an environment variable 'VLLM_ATTENTI… by @LRL-ModelCloud in #260
- [FIX] hard set batch_size to 1 for 4.43.0 transformer due to compat/regression by @LRL-ModelCloud in #279
- FIX vllm llama 3.1 support by @Qubitium in #280
- Use better defaults values for quantization config by @Qubitium in #281
- [REFRACTOR] Cleanup backend and model_type usage by @LRL-ModelCloud in #276
- [FIX] allow auto_round lm_head quantization by @LRL-ModelCloud in #282
- [FIX] [MODEL] Llama-3.1-8B-Instruct's eos_token_id is a list by @CSY-ModelCloud in #284
- [FIX] add release_vllm_model, and import destroy_model_parallel in release_vllm_model by @LRL-ModelCloud in #288
- [FIX] autoround quants compat with vllm/sglang by @Qubitium in #287
Full Changelog: v0.9.8...v0.9.9