Highlights
- support automatic mixed bits assignment by @wenhuach21 in #851
- optimize rtn for int woq by @wenhuach21 in #924
- support for model scope by @n1ck-guo in #957
- enhance auto device map and support XPU by @xin3he in #961
- support for immediate saving to reduce ram usage by @Kaihui-intel in #965
- update gguf alg ext by @n1ck-guo in #1026
What's Changed
- Fix rtn tuning_device issue by @Kaihui-intel in #893
- fix vlm gguf ut by @n1ck-guo in #895
- update alg_ext.abi3.so with python compatible version by @chensuyue in #894
- move ste from quant to round for nvfp4 by @xin3he in #889
- Add GPT-OSS quant support by @yiliu30 in #887
- better help printing information by @n1ck-guo in #883
- speedup quant and evaluation, fix recompile issue by @xin3he in #897
- fix nvfp act quantization bug by @WeiweiZhang1 in #891
- support automatic mixed bits assignment by @wenhuach21 in #851
- try to fix gguf vram issue on windows by @wenhuach21 in #886
- remove numba from requirments by @yiliu30 in #905
- Extend mxfp loading dtypes by @yiliu30 in #907
- block dataset logger info by @n1ck-guo in #908
- fix torch compile issue in AutoScheme by @wenhuach21 in #909
- Revert "Extend mxfp loading dtypes" by @wenhuach21 in #915
- support disable_opt_rtn in auto-scheme by @wenhuach21 in #913
- fix llama 4 ut by @n1ck-guo in #896
- Add numba for cpu lib by @yiliu30 in #919
- Loosen the packing restrictions for mxfp&nvfp by @WeiweiZhang1 in #911
- Extend mxfp loading dtypes by @yiliu30 in #916
- Fix act config exporting for mixed schemes by @WeiweiZhang1 in #903
- optimize rtn for int woq by @wenhuach21 in #924
- fix bug of gguf and support for LiquidAI/LFM2-1.2B by @n1ck-guo in #927
- remove numpy<2.0 limitation by @xin3he in #921
- enable regex quantization config saving for mixed bits by @WeiweiZhang1 in #825
- Fix Flux tuning issue by @mengniwang95 in #936
- gguf support for inclusionAI/Ling-flash-2.0 by @n1ck-guo in #940
- remove low_cpu_mem by @n1ck-guo in #934
- Add compatibility test by @XuehaoSun in #918
- Add commit hash to version by @XuehaoSun in #941
- gguf weight type align with original, output.weight, token_embed by @n1ck-guo in #900
- support attention mask in user's dataset by @wenhuach21 in #930
- Add diffusion README by @mengniwang95 in #923
- update readme by @wenhuach21 in #949
- refactor utils file by @n1ck-guo in #943
- update readme for sglang support by @WeiweiZhang1 in #953
- update gguf and support for CompressedLinear by @n1ck-guo in #950
- Reduce AutoSchem VRAM usage by up to 10X by @wenhuach21 in #944
- add self attribution and fix avg_bits error by @xin3he in #956
- add logo by @wenhuach21 in #960
- refine AutoScheme readme/code by @wenhuach21 in #958
- update readme by @wenhuach21 in #962
- fix critic disable_opt_rtn regression by @wenhuach21 in #963
- [1/N] Initial vllm-ext evaluation support (MXFP4 MOE) by @yiliu30 in #935
- fix bug of imatrix contains 0 by @n1ck-guo in #955
- fix rtn bug by @mengniwang95 in #966
- enhance flux doc by @mengniwang95 in #967
- clean code by @wenhuach21 in #968
- support for model scope by @n1ck-guo in #957
- merge main branch to alg_ext by @wenhuach21 in #970
- fix cuda CI backend issue, fixtypo by @WeiweiZhang1 in #974
- disable compile packing by default by @yiliu30 in #975
- enhance auto device map and support XPU by @xin3he in #961
- refine readme by @wenhuach21 in #978
- cli support for positional arguments model by @n1ck-guo in #979
- update bits in UT by @xin3he in #986
- fix guff scheme and device_map bug by @n1ck-guo in #969
- add support for Magistral-Small by @n1ck-guo in #980
- support model_dtype and fix bug of scheme contains quotes, mllm eval by @n1ck-guo in #985
- fix bug of cannot create adam compressor by @n1ck-guo in #992
- [CI] Update python to 3.12 and torch to 2.8.0 by @XuehaoSun in #741
- fix lm head bug and rm clear_mem_reach_threhold by @wenhuach21 in #997
- Reduce peak gpu memory usage and support moe estimation by @xin3he in #981
- fix cuda ut bug by @n1ck-guo in #999
- fix mllm device_map ut by @Kaihui-intel in #1000
- refine md tables by @WeiweiZhang1 in #994
- Refine exllamav2 ut by @WeiweiZhang1 in #1001
- Support for immediate saving to reduce ram usage by @Kaihui-intel in #965
- Fix diffusion multi-device ut issue by @mengniwang95 in #1002
- fix multiple devices map issue in calibration by @wenhuach21 in #1003
- Fix non auto device map by @WeiweiZhang1 in #1005
- fix multiple devices issue in Compressor and AutoScheme by @wenhuach21 in #1007
- fix cuda low_cpu_mem_usage ut by @Kaihui-intel in #1010
- Fix param missing bug by @mengniwang95 in #1008
- add device list to clear memory by @wenhuach21 in #1009
- Minor refactor for LLMC by @yiliu30 in #993
- fix one clear memory issue by @wenhuach21 in #1011
- add ut for gguf alg_ext and update so file by @n1ck-guo in #1012
- fix multi cuda ut bug by @n1ck-guo in #1014
- Including auto_scheme.default_alg into pypi by @chensuyue in #1018
- add num_device check for set_auto_device_map_for_block_with_tuning by @xin3he in #1021
- dispatch model with real max memory by @xin3he in #1022
- fix cuda ut by @n1ck-guo in #1020
- disable itrex format first by @WeiweiZhang1 in #998
- fix bug of lm_head and dispatch model,gguf eval by @n1ck-guo in #1025
- Fix the missing temporary name by @yiliu30 in #1029
- Reduce mem usage of GPT-OSS by @yiliu30 in #1013
- update gguf alg ext by @n1ck-guo in #1026
- optimize vram for gguf and add momentum by @wenhuach21 in #1031
- fix incorrect model name in readme by @wenhuach21 in #1035
- Bump into v0.9.0 by @XuehaoSun in #1024
Full Changelog: v0.8.0...v0.9.0