BUG: llmcompressor oneshot function infinitely hangs with awq & cognitivecomputations/dolphin-2.8-mistral-7b-v02

**Describe the bug**
After finetuning a model, I try to quantize it with AWQ. After `Running AWQModifier calibration` is complete, the process infinitely hangs, and never completes.

**Expected behavior**
The model should be quantized with awq.

**Environment**
using 2 A100s.
Reusing the vllm standard to collect environment information:
```text
==============================
        System Info
==============================
OS                           : Ubuntu 20.04.6 LTS (x86_64)
GCC version                  : (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version                : Could not collect
CMake version                : version 3.26.4
Libc version                 : glibc-2.31

==============================
       PyTorch Info
==============================
PyTorch version              : 2.4.0+cu121
Is debug build               : False
CUDA used to build PyTorch   : 12.1
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform              : Linux-5.4.143.bsk.5-oci-amd64-x86_64-with-glibc2.31

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.1.105
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration : 
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB

Nvidia driver version        : 470.103.01
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   48 bits physical, 48 bits virtual
CPU(s):                          256
On-line CPU(s) list:             0-254
Off-line CPU(s) list:            255
Thread(s) per core:              1
Core(s) per socket:              64
Socket(s):                       2
NUMA node(s):                    8
Vendor ID:                       AuthenticAMD
CPU family:                      25
Model:                           1
Model name:                      AMD EPYC 7J13 64-Core Processor
Stepping:                        1
Frequency boost:                 enabled
CPU MHz:                         3235.477
CPU max MHz:                     2550.0000
CPU min MHz:                     1500.0000
BogoMIPS:                        4899.56
Virtualization:                  AMD-V
L1d cache:                       2 MiB
L1i cache:                       2 MiB
L2 cache:                        32 MiB
L3 cache:                        256 MiB
NUMA node0 CPU(s):               0-15,128-143
NUMA node1 CPU(s):               16-31,144-159
NUMA node2 CPU(s):               32-47,160-175
NUMA node3 CPU(s):               48-63,176-191
NUMA node4 CPU(s):               64-79,192-207
NUMA node5 CPU(s):               80-95,208-223
NUMA node6 CPU(s):               96-111,224-239
NUMA node7 CPU(s):               112-127,240-254
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Vulnerable, IBPB: conditional, IBRS_FW, STIBP: always-on, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca

==============================
Versions of relevant libraries
==============================
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchaudio==2.4.0
[pip3] torchelastic==0.2.2
[pip3] torchvision==0.19.0
[pip3] transformers==4.51.3
[pip3] triton==3.0.0
[conda] blas                      1.0                         mkl  
[conda] cuda-cudart               12.1.105                      0    nvidia
[conda] cuda-cupti                12.1.105                      0    nvidia
[conda] cuda-libraries            12.1.0                        0    nvidia
[conda] cuda-nvrtc                12.1.105                      0    nvidia
[conda] cuda-nvtx                 12.1.105                      0    nvidia
[conda] cuda-opencl               12.2.140                      0    nvidia
[conda] cuda-runtime              12.1.0                        0    nvidia
[conda] ffmpeg                    4.3                  hf484d3e_0    pytorch
[conda] libcublas                 12.1.0.26                     0    nvidia
[conda] libcufft                  11.0.2.4                      0    nvidia
[conda] libcufile                 1.7.2.10                      0    nvidia
[conda] libcurand                 10.3.3.141                    0    nvidia
[conda] libcusolver               11.4.4.55                     0    nvidia
[conda] libcusparse               12.0.2.55                     0    nvidia
[conda] libjpeg-turbo             2.0.0                h9bf148f_0    pytorch
[conda] libnpp                    12.0.2.50                     0    nvidia
[conda] libnvjitlink              12.1.105                      0    nvidia
[conda] libnvjpeg                 12.1.1.14                     0    nvidia
[conda] mkl                       2023.1.0         h213fc3f_46343  
[conda] mkl-service               2.4.0           py310h5eee18b_1  
[conda] mkl_fft                   1.3.8           py310h5eee18b_0  
[conda] mkl_random                1.2.4           py310hdb19cb5_0  
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] nvidia-cublas-cu12        12.1.3.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.1.105                 pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.1.105                 pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.1.0.70                 pypi_0    pypi
[conda] nvidia-cufft-cu12         11.0.2.54                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.2.106               pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.4.5.107               pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.1.0.106               pypi_0    pypi
[conda] nvidia-ml-py              12.570.86                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.20.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.61                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.1.105                 pypi_0    pypi
[conda] pytorch-cuda              12.1                 ha16c6d3_5    pytorch
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] pyzmq                     26.2.0                   pypi_0    pypi
[conda] torch                     2.4.0                    pypi_0    pypi
[conda] torchaudio                2.4.0                    pypi_0    pypi
[conda] torchelastic              0.2.2                    pypi_0    pypi
[conda] torchvision               0.19.0                   pypi_0    pypi
[conda] transformers              4.51.3                   pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
Neuron SDK Version           : N/A
vLLM Version                 : 0.6.3.post1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
  	GPU0	GPU1	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	mlx5_5	mlx5_6	mlx5_7	mlx5_8	mlx5_9	mlx5_10	mlx5_11	mlx5_12	mlx5_13	mlx5_14	mlx5_15	mlx5_16	mlx5_17	CPU Affinity	NUMA Affinity
GPU0	 X 	NV12	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	SYS	SYS	16-31,144-159	1
GPU1	NV12	 X 	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	SYS	SYS	16-31,144-159	1
mlx5_0	SYS	SYS	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_1	PXB	PXB	SYS	 X 	PIX	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_2	PXB	PXB	SYS	PIX	 X 	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_3	PXB	PXB	SYS	PXB	PXB	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_4	PXB	PXB	SYS	PXB	PXB	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_5	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	PXB	PXB		
mlx5_6	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	PXB	PXB		
mlx5_7	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	PXB	PXB	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_8	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	PXB	PXB	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_9	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	 X 	PIX	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_10	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX	 X 	SYS	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_11	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	SYS	SYS	SYSSYS	SYS	SYS		
mlx5_12	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	 X 	PIX	PXBPXB	SYS	SYS		
mlx5_13	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	 X 	PXBPXB	SYS	SYS		
mlx5_14	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	 X PIX	SYS	SYS		
mlx5_15	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX X 	SYS	SYS		
mlx5_16	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	 X 	PIX		
mlx5_17	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYSSYS	PIX	 X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=GPU-88ab310d-c026-8fc5-a2b7-a332b3d7ee6b,GPU-a053f2e6-934a-eb2f-c029-3953c7983dcc
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NCCL_IB_PCI_RELAXED_ORDERING=1
NCCL_VERSION=2.17.1-1
NCCL_SOCKET_IFNAME=eth0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
NCCL_IB_HCA=
NVIDIA_PRODUCT_NAME=CUDA
NCCL_IB_GID_INDEX=
CUDA_VERSION=12.1.1
PYTORCH_VERSION=2.1.0
CUDA_MPS_PIPE_DIRECTORY=/dev/shm/pipe
NCCL_IB_TIMEOUT=23
CUDA_MPS_LOG_DIRECTORY=/dev/shm/nvidia-log
LD_LIBRARY_PATH=/opt/conda/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_IB_DISABLE=0
NCCL_IB_RETRY_CNT=7
CUDA_MODULE_LOADING=LAZY
```

**To Reproduce**
1. Download [cognitivecomputations/dolphin-2.8-mistral-7b-v02](https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02) locallly
2. Download [mit-han-lab/pile-val-backup](https://huggingface.co/datasets/mit-han-lab/pile-val-backup/tree/main) dataset locally
3. Run the script below. make sure the model path and dataset path are correct to your filesystem.

```python
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier

MODEL_ID = "/home/ryanr/cognitivecomputations/dolphin-2.8-mistral-7b-v02"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

DATASET_ID = "/home/ryanr/mit-han-lab/pile-val-backup/val.jsonl"


NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512

ds = load_dataset("json", data_files=DATASET_ID, split="train")
ds = ds.shuffle(seed=42)

def preprocess(example):
    return {
        "text": tokenizer.apply_chat_template(
            [{"role": "user", "content": example["text"]}],
            tokenize=False,
        )
    }

ds = ds.map(preprocess)	

# Tokenize inputs.
def tokenize(sample):
    return tokenizer(
        sample["text"],
        padding=False,
        max_length=MAX_SEQUENCE_LENGTH,
        truncation=True,
        add_special_tokens=False,
    )


# Configure the quantization algorithm to run.
recipe = [
    AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]


oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)


# print("\n\n")
# print("========== SAMPLE GENERATION ==============")
# input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
# output = model.generate(input_ids, max_new_tokens=100)
# print(tokenizer.decode(output[0]))
# print("==========================================\n\n")



SAVE_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"
print(f'save model to {SAVE_DIR}')
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
```

**Errors**
There is no error, the process is stuck forever.

**Additional Context**
[cognitivecomputations/dolphin-2.8-mistral-7b-v02](https://huggingface.co/cognitivecomputations/dolphin-2.8-mistral-7b-v02)
[mit-han-lab/pile-val-backup](https://huggingface.co/datasets/mit-han-lab/pile-val-backup/tree/main)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

BUG: llmcompressor oneshot function infinitely hangs with awq & cognitivecomputations/dolphin-2.8-mistral-7b-v02 #1520

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

BUG: llmcompressor oneshot function infinitely hangs with awq & cognitivecomputations/dolphin-2.8-mistral-7b-v02 #1520

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions