Skip to content

Initial vLLM setup fails due to missing HuggingFace permissions #37

Open
@milank94

Description

When following initial setup steps from: https://github.com/tenstorrent/tt-inference-server/tree/main/vllm-tt-metal-llama3-70b#vllm-tt-metalium-llama-31-70b-inference-api, fails due to missing HF token permissions to download config for Meta-Llama-3.1-70B.

docker run \
>   --rm \
>   -it \
>   --env-file .env \
>   --cap-add ALL \
>   --device /dev/tenstorrent:/dev/tenstorrent \
>   --volume /dev/hugepages-1G:/dev/hugepages-1G:rw \
>   --volume ${PERSISTENT_VOLUME?ERROR env var PERSISTENT_VOLUME must be set}:/home/user/cache_root:rw \
>   --shm-size 32G \
>   --publish 7000:7000 \
>   ghcr.io/tenstorrent/tt-inference-server/tt-metal-llama3-70b-src-base-vllm:v0.0.1-tt-metal-385904186f81-384f1790c3be
2024-11-15 09:07:01.845 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-11-15 09:07:01.847 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.Conv1d be migrated to C++?
2024-11-15 09:07:01.848 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.conv2d be migrated to C++?
2024-11-15 09:07:01.851 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.from_torch be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.to_torch be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.to_device be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.from_device be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.deallocate be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.reallocate be migrated to C++?
2024-11-15 09:07:01.852 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.load_tensor be migrated to C++?
2024-11-15 09:07:01.853 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.dump_tensor be migrated to C++?
2024-11-15 09:07:01.853 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.as_tensor be migrated to C++?
2024-11-15 09:07:01.864 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.avg_pool2d be migrated to C++?
2024-11-15 09:07:01.881 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.conv2d be migrated to C++?
2024-11-15 09:07:01.881 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.avg_pool2d be migrated to C++?
2024-11-15 09:07:01.882 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.Conv1d be migrated to C++?
INFO 11-15 09:07:02 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
INFO 11-15 09:07:04 api_server.py:528] vLLM API server version 0.1.dev3062+g384f179
INFO 11-15 09:07:04 api_server.py:529] args: Namespace(allow_credentials=False, allowed_headers=['*'], allowed_methods=['*'], allowed_origins=['*'], api_key='eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ0ZWFtX2lkIjoidGVuc3RvcnJlbnQiLCJ0b2tlbl9pZCI6ImRlYnVnLXRlc3QifQ._1fGZrJLARFZgqe-aZNr5dO_gb1gtzFrqm-aWcNvGOo', block_size=64, chat_template=None, code_revision=None, collect_detailed_traces=None, config_format=<ConfigFormat.AUTO: 'auto'>, cpu_offload_gb=0, device='auto', disable_async_output_proc=False, disable_custom_all_reduce=False, disable_fastapi_docs=False, disable_frontend_multiprocessing=False, disable_log_requests=False, disable_log_stats=False, disable_logprobs_during_spec_decoding=None, disable_sliding_window=False, distributed_executor_backend=None, download_dir=None, dtype='auto', enable_auto_tool_choice=False, enable_chunked_prefill=None, enable_lora=False, enable_prefix_caching=False, enable_prompt_adapter=False, enforce_eager=False, fully_sharded_loras=False, gpu_memory_utilization=0.9, guided_decoding_backend='outlines', host=None, ignore_patterns=[], kv_cache_dtype='auto', limit_mm_per_prompt=None, load_format='auto', log_global_stats=False, long_lora_scaling_factors=None, lora_dtype='auto', lora_extra_vocab_size=256, lora_modules=None, max_context_len_to_capture=None, max_cpu_loras=None, max_log_len=None, max_logprobs=20, max_lora_rank=16, max_loras=1, max_model_len=131072, max_num_batched_tokens=131072, max_num_seqs=32, max_parallel_loading_workers=None, max_prompt_adapter_token=0, max_prompt_adapters=1, max_seq_len_to_capture=8192, middleware=[], mm_processor_kwargs=None, model='meta-llama/Meta-Llama-3.1-70B', model_loader_extra_config=None, multi_step_stream_outputs=True, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, num_gpu_blocks_override=None, num_lookahead_slots=0, num_scheduler_steps=10, num_speculative_tokens=None, otlp_traces_endpoint=None, override_neuron_config=None, pipeline_parallel_size=1, port=7000, preemption_mode=None, prompt_adapters=None, qlora_adapter_name_or_path=None, quantization=None, quantization_param_path=None, ray_workers_use_nsight=False, response_role='assistant', return_tokens_as_token_ids=False, revision=None, root_path=None, rope_scaling=None, rope_theta=None, scheduler_delay_factor=0.0, scheduling_policy='fcfs', seed=0, served_model_name=None, skip_tokenizer_init=False, spec_decoding_acceptance_method='rejection_sampler', speculative_disable_by_batch_size=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_model=None, speculative_model_quantization=None, ssl_ca_certs=None, ssl_cert_reqs=0, ssl_certfile=None, ssl_keyfile=None, swap_space=4, tensor_parallel_size=1, tokenizer=None, tokenizer_mode='auto', tokenizer_pool_extra_config=None, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_revision=None, tool_call_parser=None, tool_parser_plugin='', trust_remote_code=False, typical_acceptance_sampler_posterior_alpha=None, typical_acceptance_sampler_posterior_threshold=None, use_v2_block_manager=False, uvicorn_log_level='info', worker_use_ray=False)
INFO 11-15 09:07:04 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/652c9ad3-a61d-4e6d-ad7f-b9aa98c42c0d for IPC Path.
INFO 11-15 09:07:04 api_server.py:179] Started engine process with PID 41
Traceback (most recent call last):
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/tt-metal/python_env/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_vllm_api_server.py", line 50, in <module>
    main()
  File "run_vllm_api_server.py", line 46, in main
    runpy.run_module("vllm.entrypoints.openai.api_server", run_name="__main__")
  File "/usr/lib/python3.8/runpy.py", line 210, in run_module
    return _run_code(code, {}, init_globals, run_name, mod_spec)
  File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/user/vllm/vllm/entrypoints/openai/api_server.py", line 585, in <module>
    uvloop.run(run_server(args))
  File "/tt-metal/python_env/lib/python3.8/site-packages/uvloop/__init__.py", line 82, in run
    return loop.run_until_complete(wrapper())
  File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
  File "/tt-metal/python_env/lib/python3.8/site-packages/uvloop/__init__.py", line 61, in wrapper
    return await main
  File "/home/user/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server
    async with build_async_engine_client(args) as engine_client:
  File "/usr/lib/python3.8/contextlib.py", line 171, in __aenter__
    return await self.gen.__anext__()
  File "/home/user/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
    async with build_async_engine_client_from_engine_args(
  File "/usr/lib/python3.8/contextlib.py", line 171, in __aenter__
    return await self.gen.__anext__()
  File "/home/user/vllm/vllm/entrypoints/openai/api_server.py", line 184, in build_async_engine_client_from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 907, in create_engine_config
    model_config = self.create_model_config()
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 843, in create_model_config
    return ModelConfig(
  File "/home/user/vllm/vllm/config.py", line 162, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/user/vllm/vllm/transformers_utils/config.py", line 148, in get_config
    if is_gguf or file_or_path_exists(model,
  File "/home/user/vllm/vllm/transformers_utils/config.py", line 86, in file_or_path_exists
    return file_exists(model, config_name, revision=revision, token=token)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 2833, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1666, in get_hf_file_metadata
    r = _request_wrapper(
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 364, in _request_wrapper
    response = _request_wrapper(
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 388, in _request_wrapper
    hf_raise_for_status(response)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-67377fb8-30d086136440fa987c2a0f8c;382a4350-1869-4ac9-92f0-9a75a7b2e168)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-70B/resolve/main/config.json.
Access to model meta-llama/Llama-3.1-70B is restricted. You must have access to it and be authenticated to access it. Please log in.
2024-11-15 09:07:05.966 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.pearson_correlation_coefficient be migrated to C++?
2024-11-15 09:07:05.966 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.Conv1d be migrated to C++?
2024-11-15 09:07:05.967 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.conv2d be migrated to C++?
2024-11-15 09:07:05.967 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.unsqueeze_to_4D be migrated to C++?
2024-11-15 09:07:05.967 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.from_torch be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.to_torch be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.to_device be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.from_device be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.allocate_tensor_on_device be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.copy_host_to_device_tensor be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.deallocate be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.reallocate be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.load_tensor be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.dump_tensor be migrated to C++?
2024-11-15 09:07:05.968 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.as_tensor be migrated to C++?
2024-11-15 09:07:05.970 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.avg_pool2d be migrated to C++?
2024-11-15 09:07:05.973 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.conv2d be migrated to C++?
2024-11-15 09:07:05.973 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.avg_pool2d be migrated to C++?
2024-11-15 09:07:05.973 | WARNING  | ttnn.decorators:operation_decorator:801 - Should ttnn.Conv1d be migrated to C++?
INFO 11-15 09:07:06 importing.py:10] Triton not installed; certain GPU-related functions will not be available.
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 406, in hf_raise_for_status
    response.raise_for_status()
  File "/tt-metal/python_env/lib/python3.8/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 401 Client Error: Unauthorized for url: https://huggingface.co/meta-llama/Meta-Llama-3.1-70B/resolve/main/config.json

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/user/vllm/vllm/engine/multiprocessing/engine.py", line 391, in run_mp_engine
    engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
  File "/home/user/vllm/vllm/engine/multiprocessing/engine.py", line 135, in from_engine_args
    engine_config = engine_args.create_engine_config()
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 907, in create_engine_config
    model_config = self.create_model_config()
  File "/home/user/vllm/vllm/engine/arg_utils.py", line 843, in create_model_config
    return ModelConfig(
  File "/home/user/vllm/vllm/config.py", line 162, in __init__
    self.hf_config = get_config(self.model, trust_remote_code, revision,
  File "/home/user/vllm/vllm/transformers_utils/config.py", line 148, in get_config
    if is_gguf or file_or_path_exists(model,
  File "/home/user/vllm/vllm/transformers_utils/config.py", line 86, in file_or_path_exists
    return file_exists(model, config_name, revision=revision, token=token)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/hf_api.py", line 2833, in file_exists
    get_hf_file_metadata(url, token=token)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_validators.py", line 114, in _inner_fn
    return fn(*args, **kwargs)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 1666, in get_hf_file_metadata
    r = _request_wrapper(
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 364, in _request_wrapper
    response = _request_wrapper(
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/file_download.py", line 388, in _request_wrapper
    hf_raise_for_status(response)
  File "/tt-metal/python_env/lib/python3.8/site-packages/huggingface_hub/utils/_http.py", line 423, in hf_raise_for_status
    raise _format(GatedRepoError, message, response) from e
huggingface_hub.errors.GatedRepoError: 401 Client Error. (Request ID: Root=1-67377fbb-5fc35dbf6d3882202048c17d;da98cf33-f7e1-4af9-bcf8-5b925ac47dbf)

Cannot access gated repo for url https://huggingface.co/meta-llama/Meta-Llama-3.1-70B/resolve/main/config.json.
Access to model meta-llama/Llama-3.1-70B is restricted. You must have access to it and be authenticated to access it. Please log in.

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions