Closed
Description
Describe the bug
When quantizing qwen3-32b models with awq, it requires cpu memory that > 1.2TB, in the step: "_calibrate | INFO - Running AWQModifier calibration with 250 samples..."
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
55135 root 20 0 1205.8g 1.1t 1.1g R 99.7 53.7 9:35.79 python3
Expected behavior
Environment
Include all relevant environment information:
- OS : Ubuntu 22.04
- Python version [e.g. 3.7]: Python 3.10.12
- LLM Compressor version or commit hash [e.g. 0.1.0,
f7245c8
]: 0.5.1 - ML framework version(s) [e.g. torch 2.3.1]:
- Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
- Other relevant environment information [e.g. hardware, CUDA version]:
To Reproduce
Exact steps to reproduce the behavior:
My code:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from mydataset import SupervisedDataset
from datasets import load_dataset
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import (
QuantizationArgs,
QuantizationScheme,
QuantizationStrategy,
QuantizationType,
)
# Select model and load it.
MODEL_ID = "/path/to/qwen3-32b/model"
# Select calibration dataset - using custom dataset
DATASET_FILE = "calib.jsonl"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 250
MAX_SEQUENCE_LENGTH = 1200
# Function to format the data consistently
def format_data(example):
# Convert the conversation format to text
if example.get('messages'):
# Format conversation messages into text
text_parts = []
for msg in example['messages']:
role = msg['role']
content = msg['content'].strip()
text_parts.append(f"<|role|>{role}<|says|>{content}<|end|>")
text = '\n'.join(text_parts)
elif example.get('ctx'):
# Context + generation format
text = example['ctx'] + example['gen']
elif example.get('txt'):
# Simple text format
text = example['txt']
else:
assert False, "Unknown format in example: {}".format(example)
print("TEXT:", text)
return {"text": text}
# Load dataset using datasets library
ds = load_dataset('json', data_files=DATASET_FILE, split='train')
# Take only the number of samples we need and shuffle
ds = ds.shuffle(seed=42).select(range(min(NUM_CALIBRATION_SAMPLES, len(ds))))
# Apply formatting function
ds = ds.map(format_data)
# Configure the quantization algorithm to run.
# NOTE: vllm currently does not support asym MoE, using symmetric here
recipe = [
AWQModifier(bits=4, symmetric=False),
QuantizationModifier(
ignore=["lm_head", "norm", "embed_tokens"],
config_groups={
"group_0": QuantizationScheme(
targets=["Linear"],
weights=QuantizationArgs(
num_bits=4,
type=QuantizationType.INT,
dynamic=False,
symmetric=False,
strategy=QuantizationStrategy.GROUP,
group_size=128,
),
)
},
),
]
SAVE_DIR = MODEL_ID + "-awq2"
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
output_dir=SAVE_DIR
)
tokenizer.save_pretrained(SAVE_DIR)
print('Done!')
Errors
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.
Additional context
Add any other context about the problem here. Also include any relevant files.