Closed
Description
Server:
vllm serve mistralai/Mistral-7B-Instruct-v0.3
Client:
guidellm --target "http://localhost:8000/v1" --model mistralai/Mistral-7B-Instruct-v0.3
These are some of the logs from my vLLM server:
INFO 08-27 16:00:20 logger.py:36] Received request chat-e82ea5381130441088e4bee0ed30630e: prompt: '<s>', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1], lora_request: None, prompt_adapter_request: None.
INFO: 127.0.0.1:36200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-27 16:00:20 async_llm_engine.py:208] Added request chat-e82ea5381130441088e4bee0ed30630e.
INFO 08-27 16:00:20 logger.py:36] Received request chat-33ae83c188894acb930b69c7cb1fd56d: prompt: '<s>', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1], lora_request: None, prompt_adapter_request: None.
INFO: 127.0.0.1:36208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-27 16:00:20 async_llm_engine.py:208] Added request chat-33ae83c188894acb930b69c7cb1fd56d.
INFO 08-27 16:00:20 async_llm_engine.py:176] Finished request chat-2368c52c95d84e238b97f17e28341af2.
INFO 08-27 16:00:20 logger.py:36] Received request chat-5575071f897f4ec9be95754d63fc937b: prompt: '<s>', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1], lora_request: None, prompt_adapter_request: None.
INFO: 127.0.0.1:36210 - "POST /v1/chat/completions HTTP/1.1" 200 OK
You can see from the prompt: '<s>'
entry that all of the prompts are just the BOS token, which has probably been added by the tokenizer.
Is this intentional that we should define some range of prompt lengths?
Metadata
Metadata
Assignees
Labels
No labels