Skip to content

Conversation

@mutichung
Copy link

Summary

Originally llm-compressor#2112.

This PR introduces a new script to convert AutoAWQ checkpoints into compressed-tensors-compatible format under modifiers/awq. Resolves llm-compressor#2087.

Usage

  • Via CLI:

    python -m compressed_tensors.converters.autoawq \
      --model-name-or-path /path/to/model \
      --output-dir /path/to/compressed/model \
      --quantization-format naive-quantized
  • Via Python:

    from compressed_tensors.converters.autoawq import load_and_convert_from_autoawq
    
    awq_model_path = "/path/to/model"  # can also be model_id on huggingface hub
    model = load_and_convert_from_autoawq(awq_model_path)

Known Issue

Asymmetric Support in compressed-tensors

  • AutoAWQ with version GEMM only supports asymmetric quantization 1.
    • AssertionError will be raised despite setting zero_point=False.
  • Support for zero-point decompression in PackedQuantizationCompressor is a WIP 2.
  • 2025/12/15 Update: zero-point decompression has been merged in 3 but reverted shortly after 4.

Test Plan

  • Create tests to compare output logits between AutoAWQ and compressed-tensors.
    • The logits do not satisfy torch.testing.assert_close, potentially due to GEMM kernel's internal precision?
  • Run and compare benchmark results between AutoAWQForCausalLM and vLLM.
    • Using compressed-tensors based on 3.
  • Created tests to compare benchmark results between AutoAWQ and compressed-tensors checkpoints.

ruikangliu/DeepSeek-R1-Distill-Qwen-1.5B-quantized.awq-autoawq-w4g128

Format Inference Backend ARC-Easy ARC-Challenge
AutoAWQ hf 0.6435 0.3584
naive-quantized hf 0.6431 0.3584
packed-quantized hf 0.6431 0.3584
packed-quantized vllm 0.6427 0.3592

AMead10/Llama-3.2-3B-Instruct-AWQ

Format Inference Backend ARC-Easy ARC-Challenge
AutoAWQ hf 0.7976 0.5017
naive-quantized hf 0.7971 0.5026
packed-quantized hf 0.7971 0.5026
packed-quantized vllm 0.7976 0.5043

fbaldassarri/mistralai_Mistral-7B-Instruct-v0.3-autoawq-int4-gs128-asym

Format Inference Backend ARC-Easy ARC-Challenge
AutoAWQ hf 0.8641 0.6280
naive-quantized hf 0.8645 0.6280
packed-quantized hf 0.8645 0.6280
packed-quantized vllm 0.8649 0.6280

Future Work

  • Support other AutoAWQ versions, e.g., GEMV.
  • Set default quantization format to packed-quantized once asymmetric decompression is finalized.
  • Replace AutoModelForCausalLM with a more generalized autoclass.

Footnotes

  1. awq/modules/linear/gemm.py#L187

  2. llm-compressor#1704

  3. fix qparams decompression #514 2

  4. Revert "fix qparams decompression (#514)" #527

Signed-off-by: GitHub <noreply@github.com>
@mutichung mutichung force-pushed the feature/convert-autoawq branch from b5efafa to 8590695 Compare December 19, 2025 02:35
Comment on lines +110 to +113
# Unpack the qweight and qzeros tensors
iweight, izeros = unpack_awq(qweight, qzeros, bits)
# Reverse the order of the iweight and izeros tensors
iweight, izeros = reverse_awq_order(iweight, izeros, bits)
Copy link
Author

@mutichung mutichung Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brian-dellabetta Since auto-round is not a dependency of compressed-tensors, should we keep a copy of the utilities for unpacking and dequantization instead of importing from auto-round?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mutichung yes i think that's a good idea. compressed-tensors is our library for storing, loading and converting compressed model checkpoint formats, so this seems like a reasonable set of helpers to add. Basically the same reason we want to move it out of llm-compressor, which is scoped to just the code and flows needed to run model compression

@brian-dellabetta
Copy link
Collaborator

Thanks @mutichung for moving this over! I replied to your message, your suggestion to move the logic here and avoid a compressed-tensors dependency on autoround make sense to me

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Request][Help Wanted] Convert AutoAWQ checkpoints to compressed-tensors

2 participants