Skip to content

Default guidellm command only sends empty messages? #34

Closed
@mgoin

Description

@mgoin

Server:

vllm serve mistralai/Mistral-7B-Instruct-v0.3

Client:

guidellm --target "http://localhost:8000/v1" --model mistralai/Mistral-7B-Instruct-v0.3

These are some of the logs from my vLLM server:

INFO 08-27 16:00:20 logger.py:36] Received request chat-e82ea5381130441088e4bee0ed30630e: prompt: '<s>', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1], lora_request: None, prompt_adapter_request: None.
INFO:     127.0.0.1:36200 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-27 16:00:20 async_llm_engine.py:208] Added request chat-e82ea5381130441088e4bee0ed30630e.
INFO 08-27 16:00:20 logger.py:36] Received request chat-33ae83c188894acb930b69c7cb1fd56d: prompt: '<s>', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1], lora_request: None, prompt_adapter_request: None.
INFO:     127.0.0.1:36208 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 08-27 16:00:20 async_llm_engine.py:208] Added request chat-33ae83c188894acb930b69c7cb1fd56d.
INFO 08-27 16:00:20 async_llm_engine.py:176] Finished request chat-2368c52c95d84e238b97f17e28341af2.
INFO 08-27 16:00:20 logger.py:36] Received request chat-5575071f897f4ec9be95754d63fc937b: prompt: '<s>', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=256, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [1], lora_request: None, prompt_adapter_request: None.
INFO:     127.0.0.1:36210 - "POST /v1/chat/completions HTTP/1.1" 200 OK

You can see from the prompt: '<s>' entry that all of the prompts are just the BOS token, which has probably been added by the tokenizer.

Is this intentional that we should define some range of prompt lengths?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions