You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note that the rope_freqs.weight is a tensor only used for llama.cpp inference speedup, and it's not a exact weight of source model.
Reproduce code
fromhuggingface_hubimporthf_hub_downloadfromvllmimportLLM, SamplingParamsdefrun_gguf_inference(model_path):
PROMPT_TEMPLATE="<|start_header_id|>system<|end_header_id|>\n" \
"{system_message}<|eot_id|>\n" \
"<|start_header_id|>user<|end_header_id|>\n" \
"{prompt}<|eot_id|>\n" \
"<|start_header_id|>assistant<|end_header_id|>"# PROMPT_TEMPLATE = "<|system|>\n{system_message}</s>\n<|user|>\n{prompt}</s>\n<|assistant|>\n" # noqa: E501system_message="You are a helpful assistant."# noqa: E501# Sample prompts.prompts= [
"How many helicopters can a human eat in one sitting?",
"What's the future of AI?",
]
prompts= [
PROMPT_TEMPLATE.format(system_message=system_message, prompt=prompt)
forpromptinprompts
]
# Create a sampling params object.sampling_params=SamplingParams(temperature=0, max_tokens=128)
# Create an LLM.llm=LLM(model=model_path, max_model_len=4096)
outputs=llm.generate(prompts, sampling_params)
# Print the outputs.foroutputinoutputs:
prompt=output.promptgenerated_text=output.outputs[0].textprint(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
if__name__=="__main__":
repo_id="bullerwins/Meta-Llama-3.1-8B-Instruct-GGUF"filename="Meta-Llama-3.1-8B-Instruct-Q2_K.gguf"model=hf_hub_download(repo_id, filename=filename)
run_gguf_inference(model)
Log
$ python examples/gguf_inference.py
Meta-Llama-3.1-8B-Instruct-Q2_K.gguf: 100%|███████████████████████████████████████████████████████████████| 3.18G/3.18G [01:44<00:00, 30.5MB/s]
INFO 08-07 14:20:20 config.py:1484] Downcasting torch.float32 to torch.float16.
WARNING 08-07 14:20:20 config.py:286] gguf quantization is not fully optimized yet. The speed can be slower than non-quantized models.
WARNING 08-07 14:20:20 arg_utils.py:769] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.
INFO 08-07 14:20:20 config.py:853] Chunked prefill is enabled with max_num_batched_tokens=512.
INFO 08-07 14:20:20 llm_engine.py:176] Initializing an LLM engine (v0.5.4) with config: model='/root/.cache/huggingface/hub/models--bullerwins--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/7701ec9902c654dfb7a573622c6b0e32276a07e4/Meta-Llama-3.1-8B-Instruct-Q2_K.gguf', speculative_config=None, tokenizer='/root/.cache/huggingface/hub/models--bullerwins--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/7701ec9902c654dfb7a573622c6b0e32276a07e4/Meta-Llama-3.1-8B-Instruct-Q2_K.gguf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.GGUF, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=gguf, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/root/.cache/huggingface/hub/models--bullerwins--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/7701ec9902c654dfb7a573622c6b0e32276a07e4/Meta-Llama-3.1-8B-Instruct-Q2_K.gguf, use_v2_block_manager=False, enable_prefix_caching=False)
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes foryou. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explainedin https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
INFO 08-07 14:21:36 selector.py:218] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 14:21:36 selector.py:117] Using XFormers backend.
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract`in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/envs/vllm/lib/python3.10/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract`in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 08-07 14:21:37 model_runner.py:721] Starting to load model /root/.cache/huggingface/hub/models--bullerwins--Meta-Llama-3.1-8B-Instruct-GGUF/snapshots/7701ec9902c654dfb7a573622c6b0e32276a07e4/Meta-Llama-3.1-8B-Instruct-Q2_K.gguf...
INFO 08-07 14:21:58 selector.py:218] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 08-07 14:21:58 selector.py:117] Using XFormers backend.
[rank0]: Traceback (most recent call last):
[rank0]: File "/kaggle/working/vllm/examples/gguf_inference.py", line 43, in<module>
[rank0]: run_gguf_inference(model)
[rank0]: File "/kaggle/working/vllm/examples/gguf_inference.py", line 29, in run_gguf_inference
[rank0]: llm = LLM(model=model_path)
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 167, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 447, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 251, in __init__
[rank0]: self.model_executor = executor_class(
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 47, in __init__
[rank0]: self._init_executor()
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 36, in _init_executor
[rank0]: self.driver_worker.load_model()
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 150, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 723, in load_model
[rank0]: self.model = get_model(model_config=self.model_config,
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]: return loader.load_model(model_config=model_config,
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 1051, in load_model
[rank0]: model.load_weights(
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/llama.py", line 471, in load_weights
[rank0]: forname, loaded_weightin weights:
[rank0]: File "/opt/conda/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/weight_utils.py", line 439, in gguf_quant_weights_iterator
[rank0]: name = gguf_to_hf_name_map[tensor.name]
[rank0]: KeyError: 'rope_freqs.weight'
The text was updated successfully, but these errors were encountered:
Your current environment
🐛 Describe the bug
Brief description
Bug description
llama.cpp
conversion script updated in llama.cpp PR-8676 will contain arope_freqs.weight
tensor.rope_freqs.weight
is a tensor only used forllama.cpp
inference speedup, and it's not a exact weight of source model.Reproduce code
Log
The text was updated successfully, but these errors were encountered: