Extreme CPU memory usage(>1TB) with qwen3-32b and AWQ

**Describe the bug**

When quantizing qwen3-32b models with awq, it requires cpu memory that > 1.2TB, in the step: "_calibrate | INFO - Running AWQModifier calibration with 250 samples..."

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                
  55135 root      20   0 1205.8g   1.1t   1.1g R  99.7  53.7   9:35.79 python3   

**Expected behavior**


**Environment**
Include all relevant environment information:
1. OS : Ubuntu 22.04
2. Python version [e.g. 3.7]: Python 3.10.12
3. LLM Compressor version or commit hash [e.g. 0.1.0, `f7245c8`]:  0.5.1
4. ML framework version(s) [e.g. torch 2.3.1]:
5. Other Python package versions [e.g. vLLM, compressed-tensors, numpy, ONNX]:
6. Other relevant environment information [e.g. hardware, CUDA version]:

**To Reproduce**
Exact steps to reproduce the behavior:

My code:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
from mydataset import SupervisedDataset
from datasets import load_dataset
from llmcompressor.modifiers.quantization import QuantizationModifier
from compressed_tensors.quantization import (
    QuantizationArgs,
    QuantizationScheme,
    QuantizationStrategy,
    QuantizationType,
)

# Select model and load it.
MODEL_ID = "/path/to/qwen3-32b/model"


# Select calibration dataset - using custom dataset
DATASET_FILE = "calib.jsonl" 


model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Select number of samples. 256 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 250
MAX_SEQUENCE_LENGTH = 1200

# Function to format the data consistently
def format_data(example):
    # Convert the conversation format to text
    if example.get('messages'):
        # Format conversation messages into text
        text_parts = []
        for msg in example['messages']:
            role = msg['role']
            content = msg['content'].strip()
            text_parts.append(f"<|role|>{role}<|says|>{content}<|end|>")
        text = '\n'.join(text_parts)
    elif example.get('ctx'):
        # Context + generation format
        text = example['ctx'] + example['gen']
    elif example.get('txt'):
        # Simple text format
        text = example['txt']
    else:
        assert False, "Unknown format in example: {}".format(example)
    print("TEXT:", text)
    return {"text": text}

# Load dataset using datasets library
ds = load_dataset('json', data_files=DATASET_FILE, split='train')

# Take only the number of samples we need and shuffle
ds = ds.shuffle(seed=42).select(range(min(NUM_CALIBRATION_SAMPLES, len(ds))))

# Apply formatting function
ds = ds.map(format_data)

# Configure the quantization algorithm to run.
# NOTE: vllm currently does not support asym MoE, using symmetric here
recipe = [
    AWQModifier(bits=4, symmetric=False),
    QuantizationModifier(
        ignore=["lm_head", "norm", "embed_tokens"],
        config_groups={
            "group_0": QuantizationScheme(
                targets=["Linear"],
                weights=QuantizationArgs(
                    num_bits=4,
                    type=QuantizationType.INT,
                    dynamic=False,
                    symmetric=False,
                    strategy=QuantizationStrategy.GROUP,
                    group_size=128,
                ),
            )
        },
    ),
]


SAVE_DIR = MODEL_ID + "-awq2"

oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
    output_dir=SAVE_DIR
)

tokenizer.save_pretrained(SAVE_DIR)


print('Done!')
```


**Errors**
If applicable, add a full print-out of any errors or exceptions that are raised or include screenshots to help explain your problem.

**Additional context**
Add any other context about the problem here. Also include any relevant files.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Extreme CPU memory usage(>1TB) with qwen3-32b and AWQ #1539

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Extreme CPU memory usage(>1TB) with qwen3-32b and AWQ #1539

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions