Closed
Description
Describe the bug
After finetuning a model, I try to quantize it with AWQ. After Running AWQModifier calibration
is complete, the process infinitely hangs, and never completes.
Expected behavior
The model should be quantized with awq.
Environment
using 2 A100s.
Reusing the vllm standard to collect environment information:
==============================
System Info
==============================
OS : Ubuntu 20.04.6 LTS (x86_64)
GCC version : (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version : Could not collect
CMake version : version 3.26.4
Libc version : glibc-2.31
==============================
PyTorch Info
==============================
PyTorch version : 2.4.0+cu121
Is debug build : False
CUDA used to build PyTorch : 12.1
ROCM used to build PyTorch : N/A
==============================
Python Environment
==============================
Python version : 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform : Linux-5.4.143.bsk.5-oci-amd64-x86_64-with-glibc2.31
==============================
CUDA / GPU Info
==============================
Is CUDA available : True
CUDA runtime version : 12.1.105
CUDA_MODULE_LOADING set to : LAZY
GPU models and configuration :
GPU 0: NVIDIA A100-SXM4-40GB
GPU 1: NVIDIA A100-SXM4-40GB
Nvidia driver version : 470.103.01
cuDNN version : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version : N/A
MIOpen runtime version : N/A
Is XNNPACK available : True
==============================
CPU Info
==============================
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 48 bits physical, 48 bits virtual
CPU(s): 256
On-line CPU(s) list: 0-254
Off-line CPU(s) list: 255
Thread(s) per core: 1
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC 7J13 64-Core Processor
Stepping: 1
Frequency boost: enabled
CPU MHz: 3235.477
CPU max MHz: 2550.0000
CPU min MHz: 1500.0000
BogoMIPS: 4899.56
Virtualization: AMD-V
L1d cache: 2 MiB
L1i cache: 2 MiB
L2 cache: 32 MiB
L3 cache: 256 MiB
NUMA node0 CPU(s): 0-15,128-143
NUMA node1 CPU(s): 16-31,144-159
NUMA node2 CPU(s): 32-47,160-175
NUMA node3 CPU(s): 48-63,176-191
NUMA node4 CPU(s): 64-79,192-207
NUMA node5 CPU(s): 80-95,208-223
NUMA node6 CPU(s): 96-111,224-239
NUMA node7 CPU(s): 112-127,240-254
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Vulnerable, IBPB: conditional, IBRS_FW, STIBP: always-on, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca
==============================
Versions of relevant libraries
==============================
[pip3] mypy==1.10.0
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.26.4
[pip3] nvidia-cublas-cu12==12.1.3.1
[pip3] nvidia-cuda-cupti-cu12==12.1.105
[pip3] nvidia-cuda-nvrtc-cu12==12.1.105
[pip3] nvidia-cuda-runtime-cu12==12.1.105
[pip3] nvidia-cudnn-cu12==9.1.0.70
[pip3] nvidia-cufft-cu12==11.0.2.54
[pip3] nvidia-curand-cu12==10.3.2.106
[pip3] nvidia-cusolver-cu12==11.4.5.107
[pip3] nvidia-cusparse-cu12==12.1.0.106
[pip3] nvidia-ml-py==12.570.86
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] nvidia-nvjitlink-cu12==12.8.61
[pip3] nvidia-nvtx-cu12==12.1.105
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.0
[pip3] torchaudio==2.4.0
[pip3] torchelastic==0.2.2
[pip3] torchvision==0.19.0
[pip3] transformers==4.51.3
[pip3] triton==3.0.0
[conda] blas 1.0 mkl
[conda] cuda-cudart 12.1.105 0 nvidia
[conda] cuda-cupti 12.1.105 0 nvidia
[conda] cuda-libraries 12.1.0 0 nvidia
[conda] cuda-nvrtc 12.1.105 0 nvidia
[conda] cuda-nvtx 12.1.105 0 nvidia
[conda] cuda-opencl 12.2.140 0 nvidia
[conda] cuda-runtime 12.1.0 0 nvidia
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libcublas 12.1.0.26 0 nvidia
[conda] libcufft 11.0.2.4 0 nvidia
[conda] libcufile 1.7.2.10 0 nvidia
[conda] libcurand 10.3.3.141 0 nvidia
[conda] libcusolver 11.4.4.55 0 nvidia
[conda] libcusparse 12.0.2.55 0 nvidia
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
[conda] libnpp 12.0.2.50 0 nvidia
[conda] libnvjitlink 12.1.105 0 nvidia
[conda] libnvjpeg 12.1.1.14 0 nvidia
[conda] mkl 2023.1.0 h213fc3f_46343
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.8 py310h5eee18b_0
[conda] mkl_random 1.2.4 py310hdb19cb5_0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi
[conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi
[conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi
[conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi
[conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi
[conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi
[conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi
[conda] nvidia-ml-py 12.570.86 pypi_0 pypi
[conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi
[conda] nvidia-nvjitlink-cu12 12.8.61 pypi_0 pypi
[conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi
[conda] pytorch-cuda 12.1 ha16c6d3_5 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] pyzmq 26.2.0 pypi_0 pypi
[conda] torch 2.4.0 pypi_0 pypi
[conda] torchaudio 2.4.0 pypi_0 pypi
[conda] torchelastic 0.2.2 pypi_0 pypi
[conda] torchvision 0.19.0 pypi_0 pypi
[conda] transformers 4.51.3 pypi_0 pypi
[conda] triton 3.0.0 pypi_0 pypi
==============================
vLLM Info
==============================
ROCM Version : Could not collect
Neuron SDK Version : N/A
vLLM Version : 0.6.3.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 mlx5_10 mlx5_11 mlx5_12 mlx5_13 mlx5_14 mlx5_15 mlx5_16 mlx5_17 CPU Affinity NUMA Affinity
GPU0 X NV12 SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS 16-31,144-159 1
GPU1 NV12 X SYS PXB PXB PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS 16-31,144-159 1
mlx5_0 SYS SYS X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS
mlx5_1 PXB PXB SYS X PIX PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS
mlx5_2 PXB PXB SYS PIX X PXB PXB SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS
mlx5_3 PXB PXB SYS PXB PXB X PIX SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS
mlx5_4 PXB PXB SYS PXB PXB PIX X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS
mlx5_5 SYS SYS SYS SYS SYS SYS SYS X PIX SYS SYS SYS SYS SYS SYS SYS SYSSYS PXB PXB
mlx5_6 SYS SYS SYS SYS SYS SYS SYS PIX X SYS SYS SYS SYS SYS SYS SYS SYSSYS PXB PXB
mlx5_7 SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX PXB PXB SYS SYS SYS SYSSYS SYS SYS
mlx5_8 SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X PXB PXB SYS SYS SYS SYSSYS SYS SYS
mlx5_9 SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB X PIX SYS SYS SYS SYSSYS SYS SYS
mlx5_10 SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX X SYS SYS SYS SYSSYS SYS SYS
mlx5_11 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X SYS SYS SYSSYS SYS SYS
mlx5_12 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS X PIX PXBPXB SYS SYS
mlx5_13 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PIX X PXBPXB SYS SYS
mlx5_14 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB X PIX SYS SYS
mlx5_15 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS PXB PXB PIX X SYS SYS
mlx5_16 SYS SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYSSYS X PIX
mlx5_17 SYS SYS SYS SYS SYS SYS SYS PXB PXB SYS SYS SYS SYS SYS SYS SYS SYSSYS PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
==============================
Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=GPU-88ab310d-c026-8fc5-a2b7-a332b3d7ee6b,GPU-a053f2e6-934a-eb2f-c029-3953c7983dcc
NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=450,driver<451 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=510,driver<511 brand=unknown,driver>=510,driver<511 brand=nvidia,driver>=510,driver<511 brand=nvidiartx,driver>=510,driver<511 brand=geforce,driver>=510,driver<511 brand=geforcertx,driver>=510,driver<511 brand=quadro,driver>=510,driver<511 brand=quadrortx,driver>=510,driver<511 brand=titan,driver>=510,driver<511 brand=titanrtx,driver>=510,driver<511 brand=tesla,driver>=515,driver<516 brand=unknown,driver>=515,driver<516 brand=nvidia,driver>=515,driver<516 brand=nvidiartx,driver>=515,driver<516 brand=geforce,driver>=515,driver<516 brand=geforcertx,driver>=515,driver<516 brand=quadro,driver>=515,driver<516 brand=quadrortx,driver>=515,driver<516 brand=titan,driver>=515,driver<516 brand=titanrtx,driver>=515,driver<516 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526
NCCL_IB_PCI_RELAXED_ORDERING=1
NCCL_VERSION=2.17.1-1
NCCL_SOCKET_IFNAME=eth0
NVIDIA_DRIVER_CAPABILITIES=compute,utility
NCCL_DEBUG=INFO
NCCL_IB_HCA=
NVIDIA_PRODUCT_NAME=CUDA
NCCL_IB_GID_INDEX=
CUDA_VERSION=12.1.1
PYTORCH_VERSION=2.1.0
CUDA_MPS_PIPE_DIRECTORY=/dev/shm/pipe
NCCL_IB_TIMEOUT=23
CUDA_MPS_LOG_DIRECTORY=/dev/shm/nvidia-log
LD_LIBRARY_PATH=/opt/conda/lib/python3.10/site-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NCCL_IB_DISABLE=0
NCCL_IB_RETRY_CNT=7
CUDA_MODULE_LOADING=LAZY
To Reproduce
- Download cognitivecomputations/dolphin-2.8-mistral-7b-v02 locallly
- Download mit-han-lab/pile-val-backup dataset locally
- Run the script below. make sure the model path and dataset path are correct to your filesystem.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.awq import AWQModifier
MODEL_ID = "/home/ryanr/cognitivecomputations/dolphin-2.8-mistral-7b-v02"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, device_map="auto", torch_dtype="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
DATASET_ID = "/home/ryanr/mit-han-lab/pile-val-backup/val.jsonl"
NUM_CALIBRATION_SAMPLES = 256
MAX_SEQUENCE_LENGTH = 512
ds = load_dataset("json", data_files=DATASET_ID, split="train")
ds = ds.shuffle(seed=42)
def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
[{"role": "user", "content": example["text"]}],
tokenize=False,
)
}
ds = ds.map(preprocess)
# Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
# Configure the quantization algorithm to run.
recipe = [
AWQModifier(ignore=["lm_head"], scheme="W4A16_ASYM", targets=["Linear"]),
]
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# print("\n\n")
# print("========== SAMPLE GENERATION ==============")
# input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
# output = model.generate(input_ids, max_new_tokens=100)
# print(tokenizer.decode(output[0]))
# print("==========================================\n\n")
SAVE_DIR = MODEL_ID.split("/")[-1] + "-awq-asym"
print(f'save model to {SAVE_DIR}')
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
Errors
There is no error, the process is stuck forever.
Additional Context
cognitivecomputations/dolphin-2.8-mistral-7b-v02
mit-han-lab/pile-val-backup