Skip to content

AutoAWQForCausalLM requires the download of pile-val-backup #2458

Closed
@e576082c

Description

@e576082c

I installed vllm to automatically run some tests on a bunch of Mistral-7B models, (what I cooked up locally, and I do NOT want to upload to huggingface before properly testing them). The plan is to:

  1. Convert the fp16 safetensors model to AutoAWQ. But don't save it, keep it in memory. Yeah, I don't have any more free space on my HDD, and anyway I don't want to litter my HDD with a bunch of AWQ quantized failed or not failed experimental models. Each one of the fp16 safetensors models on disk is around 14.5G, and the quantized models should be much smaller than that (If I check stuff on HF uploaded by TheBloke, the model size should be around just 4.15GB in safetensors format). Both the quantized and the unquantized models would fit into my 64GB system RAM, even at the same time.
  2. Run the tests ("inference" you might call it so) on the quantized model (taken from RAM, loaded into VRAM) with vllm. vllm is only for GPU inference, and one AWQ quantized 7B model would surely fit into my 12GB VRAM (NVIDIA card with working CUDA), because yeah, I have no problems running Q8 GGUF 7B Mistral models fully offloaded to my GPU on kobold.cpp. Point is, vllm should be faster with running an even smaller 4bit AWQ 7B model, and unlike kobold.cpp, I can make a python program using vllm, what would iterate over every model I want to test, and run all tests in a batch quickly.
  3. Before over-complicating things with testing logic and my spaghetti code looping over multiple model folders, the first logical thing to do is to check, whether my AutoAWQ+vlmm idea would work at all. (Could it split out some text from a known good model or not?).
  4. Here's the problem: AutoAWQForCausalLM wants to download mit-han-lab/pile-val-backup for no sane reason, without any explanation or warning. My disk is full, so it wouldn't fit in there anyway. IMHO, I can mount a tmpfs (RAM disk) and slap in there the files of "pile-val-backup", then bind-mount it to whatever place where AutoAWQForCausalLM may require it, but this download looks suspicious. Why is this dataset required, and why does it have no explanation whatsoever its license might be? On the dataset model card I can see: "Please respect the original license of the dataset." But the license is not stated anywhere! Also, I can find in its name "backup", an other warning sign, that it might be something rouge, what was taken down, then re-uploaded by someone random on the net.
  5. So, I checked around of course, and I found only this dismissed, ignored issue from the AutoAWQ repo. The "solution" says: "Perhaps your network is blocking Huggingface?" Well, don't ya say? Why would I wanna even download this dataset? Issue closed, and ignored. Everybody is happy, just download it, without thinking, right? Welp, no. I don't think so.

So back to the issue about vllm, and how all of this might be related to vllm:

For quick testing, I copy-pasted and modded some code from docs.vllm.ai/en/latest/quantization/auto_awq.html. My code isn't much different from the one in the official docs at vllm.ai, and this particular code triggers the download of "pile-val-backup".

Perharps I messed up somthing in the code, but I honestly don't think so. Please have a look at it:

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = "/mnt/AI/models/safetensors/loyal-piano-m7"
quant_config = { "zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM" }

# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path, **{"low_cpu_mem_usage": True}, safetensors=True)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=False)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model=model, quantization="AWQ")

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

I suppose my code is quite trivial, almost the same as in the docs.
Ah, and before I forget it:

$ cd "/mnt/AI/models/safetensors/loyal-piano-m7"
$ ls
.
..
config.json
model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
model.safetensors.index.json
README.md
special_tokens_map.json
tokenizer_config.json
tokenizer.json
tokenizer.model
$ python3 --version
Python 3.11.2
$ pip show vllm
Name: vllm
Version: 0.2.7
Summary: A high-throughput and memory-efficient inference and serving engine for LLMs
Home-page: https://github.com/vllm-project/vllm
Author: vLLM Team
Author-email: 
License: Apache 2.0
Location: /mnt/AI/runner/venv/lib/python3.11/site-packages
Requires: aioprometheus, fastapi, ninja, numpy, psutil, pydantic, ray, sentencepiece, torch, transformers, uvicorn, xformers
Required-by: 
$ pip show torch
Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: /mnt/AI/runner/venv/lib/python3.11/site-packages
Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions
Required-by: accelerate, autoawq, lm_eval, peft, torchaudio, torchvision, vllm, xformers

The error I get is:

Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 13.46it/s]
Traceback (most recent call last):
  File "/mnt/AI/runner/vllm_autoAWQ_runner.py", line 12, in <module>
    model.quantize(tokenizer, quant_config=quant_config)
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/awq/models/base.py", line 89, in quantize
    quantizer = AwqQuantizer(
                ^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/awq/quantize/quantizer.py", line 36, in __init__
    self.modules, self.module_kwargs, self.inps = self.init_quant()
                                                  ^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/awq/quantize/quantizer.py", line 320, in init_quant
    samples = get_calib_dataset(
              ^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/awq/utils/calib_data.py", line 11, in get_calib_dataset
    dataset = load_dataset("mit-han-lab/pile-val-backup", split="validation")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/datasets/load.py", line 2523, in load_dataset
    builder_instance = load_dataset_builder(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/datasets/load.py", line 2195, in load_dataset_builder
    dataset_module = dataset_module_factory(
                     ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/AI/runner/venv/lib/python3.11/site-packages/datasets/load.py", line 1838, in dataset_module_factory
    raise ConnectionError(f"Couldn't reach the Hugging Face Hub for dataset '{path}': {e1}") from None
ConnectionError: Couldn't reach the Hugging Face Hub for dataset 'mit-han-lab/pile-val-backup': Offline mode is enabled.

And finally, here is some genreic code, what currently works, but it's slow, so I'm not happy with it, so I would like to use vllm instead:

import torch
from transformers import LlamaTokenizer, MistralForCausalLM, BitsAndBytesConfig, pipeline
from transformers import set_seed

set_seed(1)

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type = "nf4",
    bnb_4bit_use_double_quant = False,
    bnb_4bit_compute_dtype=torch.float16
)

model_name = "/mnt/AI/models/safetensors/loyal-piano-m7"

tokenizer = LlamaTokenizer.from_pretrained(model_name)

model = MistralForCausalLM.from_pretrained(
        model_name,
        load_in_4bit=True,
        quantization_config = bnb_config,
        torch_dtype = torch.float16,
        device_map = "auto",
        trust_remote_code = False,
        low_cpu_mem_usage=True
    )


pipe = pipeline(
    "text-generation", 
    model = model, 
    tokenizer = tokenizer, 
    torch_dtype = torch.float16, 
    device_map = "auto"
)


prompt = "\nWikipedia is "

sequences = pipe(
    prompt,
    do_sample = True,
    max_new_tokens = 512,
    temperature = 1.0, 
    top_k = 1, 
    top_p = 1.0,
    typical_p = 1.0,
    repetition_penalty = 1.0,
    epsilon_cutoff = 0.0,
    eta_cutoff = 0.0,
    diversity_penalty = 0.0,
    length_penalty = 1.0,
    return_full_text=True,
    use_cache = False,
    num_return_sequences = 1
)

print(sequences[0]['generated_text'])

So... vllm doesn't work, while the generic code I put together from Huggingface docs does work, but it's too slow.

I would really like to try out vllm, but I won't download a random shady dataset (pile-val-backup), what AWQ requires for whatever reason.

Please remove the dependency on "pile-val-backup".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions