Closed
Description
Your current environment
docker 0.4.0.post1
🐛 Describe the bug
docker run -d --runtime=nvidia --gpus '"device=1"' --shm-size=10.24gb -p 5001:5001 -e NCCL_IGNORE_DISABLED_P2P=1 -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN -v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:ro -u `id -u`:`id -g` -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ --network host vllm/vllm-openai:latest --port=5001 --host=0.0.0.0 --model=microsoft/Phi-3-mini-128k-instruct --seed 1234 --trust-remote-code --tensor-parallel-size=1 --max-num-batched-tokens=131072 --max-log-len=100 --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.phi3.txt
gives:
(h2ogpt) fsuser@e2e-77-235:~/h2ogpt_ops$ docker logs d7b0c7e07f4d6055cce27ba8e7244860463d89ad36aa7bf2e0e9e13ea7941843
INFO 04-24 06:29:08 api_server.py:149] vLLM API server version 0.4.0.post1
INFO 04-24 06:29:08 api_server.py:150] args: Namespace(host='0.0.0.0', port=5001, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], model='microsoft/Phi-3-mini-128k-instruct', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir='/home/fsuser/.cache/huggingface/hub', load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=1234, swap_space=4, gpu_memory_utilization=0.9, forced_num_gpu_blocks=None, max_num_batched_tokens=131072, max_num_seqs=256, max_logprobs=5, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='auto', image_input_type=None, image_token_id=None, image_input_shape=None, image_feature_size=None, scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, engine_use_ray=False, disable_log_requests=False, max_log_len=100)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/workspace/vllm/entrypoints/openai/api_server.py", line 157, in <module>
engine = AsyncLLMEngine.from_engine_args(
File "/workspace/vllm/engine/async_llm_engine.py", line 331, in from_engine_args
engine_config = engine_args.create_engine_config()
File "/workspace/vllm/engine/arg_utils.py", line 406, in create_engine_config
model_config = ModelConfig(
File "/workspace/vllm/config.py", line 125, in __init__
self.max_model_len = _get_and_verify_max_len(self.hf_text_config,
File "/workspace/vllm/config.py", line 969, in _get_and_verify_max_len
assert "factor" in rope_scaling
AssertionError