Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: model loading failed for some Llama3.1 GGUF model #7268

Closed
Isotr0py opened this issue Aug 7, 2024 · 0 comments · Fixed by #7269
Closed

[Bug]: model loading failed for some Llama3.1 GGUF model #7268

Isotr0py opened this issue Aug 7, 2024 · 0 comments · Fixed by #7269
Labels
bug Something isn't working

Comments

@Isotr0py
Copy link
Collaborator

Isotr0py commented Aug 7, 2024

Your current environment

The output of `python collect_env.py`

🐛 Describe the bug

Brief description

Bug description

  • Some Llama3.1 GGUF model which using the new llama.cpp conversion script updated in llama.cpp PR-8676 will contain a rope_freqs.weight tensor.
  • The extra rope_freqs weight cause a weight loading error as reported in [Core] Support loading GGUF model #5191.
  • Note that the rope_freqs.weight is a tensor only used for llama.cpp inference speedup, and it's not a exact weight of source model.

Reproduce code

from huggingface_hub import hf_hub_download

from vllm import LLM, SamplingParams


def run_gguf_inference(model_path):
    PROMPT_TEMPLATE = "<|start_header_id|>system<|end_header_id|>\n" \
        "{system_message}<|eot_id|>\n" \
        "<|start_header_id|>user<|end_header_id|>\n" \
        "{prompt}<|eot_id|>\n" \
        "<|start_header_id|>assistant<|end_header_id|>"

    # PROMPT_TEMPLATE = "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n"  # noqa: E501
    system_message = "You are a helpful assistant."  # noqa: E501
    # Sample prompts.
    prompts = [
        "How many helicopters can a human eat in one sitting?",
        "What's the future of AI?",
    ]
    prompts = [
        PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt)
        for prompt in prompts
    ]
    # Create a sampling params object.
    sampling_params = SamplingParams(temperature=0, max_tokens=128)

    # Create an LLM.
    llm = LLM(model=model_path, max_model_len=4096)

    outputs = llm.generate(prompts, sampling_params)
    # Print the outputs.
    for output in outputs:
        prompt = output.prompt
        generated_text = output.outputs[0].text
        print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")


if __name__ == "__main__":
    repo_id = "bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF"
    filename = "Meta-Llama-3.1-8B-Instruct-Q2_K.gguf"
    model = hf_hub_download(repo_id, filename=filename)
    run_gguf_inference(model)

Log

$ python examples/gguf_inference.py 
Meta-Llama-3.1-8B-Instruct-Q2_K.gguf: 100%|███████████████████████████████████████████████████████████████| 3.18G/3.18G [01:44<00:00, 30.5MB/s]
INFO 08-07 14:20:20 config.py:1484] Downcasting torch.float32 to torch.float16.
WARNING 08-07 14:20:20 config.py:286] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 08-07 14:20:20 arg_utils.py:769] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-07 14:20:20 config.py:853] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-07 14:20:20 llm_engine.py:176] Initializing an LLM engine (v0.5.4) with config: model='/root/.cache/huggingface/hub/models--bullerwins--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/7701ec9902c654dfb7a573622c6b0e32276a07e4/Meta-Llama-3.1-8B-Instruct-Q2_K.gguf', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--bullerwins--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/7701ec9902c654dfb7a573622c6b0e32276a07e4/Meta-Llama-3.1-8B-Instruct-Q2_K.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/root/.cache/huggingface/hub/models--bullerwins--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/7701ec9902c654dfb7a573622c6b0e32276a07e4/Meta-Llama-3.1-8B-Instruct-Q2_K.gguf, use_v2_block_manager=False, enable_prefix_caching=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 08-07 14:21:36 selector.py:218] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 14:21:36 selector.py:117] Using XFormers backend.
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-07 14:21:37 model_runner.py:721] Starting to load model /root/.cache/huggingface/hub/models--bullerwins--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/7701ec9902c654dfb7a573622c6b0e32276a07e4/Meta-Llama-3.1-8B-Instruct-Q2_K.gguf...
INFO 08-07 14:21:58 selector.py:218] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 14:21:58 selector.py:117] Using XFormers backend.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/kaggle/working/vllm/examples/gguf_inference.py", line 43, in <module>
[rank0]:     run_gguf_inference(model)
[rank0]:   File "/kaggle/working/vllm/examples/gguf_inference.py", line 29, in run_gguf_inference
[rank0]:     llm = LLM(model=model_path)
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 167, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 447, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 150, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 723, in load_model
[rank0]:     self.model = get_model(model_config=self.model_config,
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 1051, in load_model
[rank0]:     model.load_weights(
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 471, in load_weights
[rank0]:     for name, loaded_weight in weights:
[rank0]:   File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 439, in gguf_quant_weights_iterator
[rank0]:     name = gguf_to_hf_name_map[tensor.name]
[rank0]: KeyError: 'rope_freqs.weight'
@Isotr0py Isotr0py added the bug Something isn't working label Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant