Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce return_prompt request option to api_server entrypoint #1232

Closed

Conversation

danilopeixoto
Copy link
Contributor

@danilopeixoto danilopeixoto commented Sep 30, 2023

Introduce return_prompt request option to api_server entrypoint. The default value for return_prompt is False.

Issue: #1043

@danilopeixoto danilopeixoto changed the title Introduce return_prompt option to api_server entrypoint Introduce return_prompt request option to api_server entrypoint Sep 30, 2023
@viktor-ferenczi
Copy link
Contributor

Please consider implementing the echo option in the OpenAI compatible API server (endpoints/openai/api_server.py):

See: https://platform.openai.com/docs/api-reference/completions/create

@viktor-ferenczi
Copy link
Contributor

Add simple test cases to cover the flags.

@danilopeixoto
Copy link
Contributor Author

Add simple test cases to cover the flags.

I will add it.

@danilopeixoto
Copy link
Contributor Author

Please consider implementing the echo option in the OpenAI compatible API server (endpoints/openai/api_server.py):

See: https://platform.openai.com/docs/api-reference/completions/create

I'm considering it. Should I add it in the same PR or new one?

@viktor-ferenczi
Copy link
Contributor

Same PR. They belong together, same feature.

@zhuohan123
Copy link
Member

zhuohan123 commented Oct 8, 2023

Thank you for your contribution! Our goal of api_server.py is to provide a minimal example of an API server. Adding this parameter will add unnecessary complexity. For your use case, you can make a copy of api_server.py and modify the server for your needs accordingly. Feel free to submit a new PR if you believe this is needed. In addition, we will support echo (#959) in our openai-compatible endpoint.

@zhuohan123 zhuohan123 closed this Oct 8, 2023
@danilopeixoto
Copy link
Contributor Author

danilopeixoto commented Oct 21, 2023

@zhuohan123 The addition does not look complex at all. At least not complex as stream option. We were using this endpoint to run benchmark against another server. The other server has an option to return only the generated text.

Please let me know if I can open a new PR. Thanks anyway!

@PeterXiaTian
Copy link

这个问题解决了吗?怎么刚测试还有了?

@PeterXiaTian
Copy link

can use res["text"][0].replace(promopt,"") to solve this problem??

@Kaotic3
Copy link

Kaotic3 commented Mar 15, 2024

"can use res["text"][0].replace(promopt,"") to solve this problem??"

You can in simple examples, but for a RAG workflow this is not going to work as you can be sending complex data that will be formatted accordingly when sent and therefore won't be the same when it is returned.

The only way to do that would be to change every document in your RAG workflow to mirror the prompt.

I see this as closed, but I don't see "return_prompt" in the uses for the vllm api server?

usage: api_server.py [-h] [--host HOST] [--port PORT] [--allow-credentials] [--allowed-origins ALLOWED_ORIGINS] [--allowed-methods ALLOWED_METHODS]
                     [--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY] [--served-model-name SERVED_MODEL_NAME]
                     [--lora-modules LORA_MODULES [LORA_MODULES ...]] [--chat-template CHAT_TEMPLATE] [--response-role RESPONSE_ROLE] [--ssl-keyfile SSL_KEYFILE]
                     [--ssl-certfile SSL_CERTFILE] [--root-path ROOT_PATH] [--middleware MIDDLEWARE] [--model MODEL] [--tokenizer TOKENIZER] [--revision REVISION]
                     [--code-revision CODE_REVISION] [--tokenizer-revision TOKENIZER_REVISION] [--tokenizer-mode {auto,slow}] [--trust-remote-code]
                     [--download-dir DOWNLOAD_DIR] [--load-format {auto,pt,safetensors,npcache,dummy}] [--dtype {auto,half,float16,bfloat16,float,float32}]
                     [--kv-cache-dtype {auto,fp8_e5m2}] [--max-model-len MAX_MODEL_LEN] [--worker-use-ray] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE]
                     [--tensor-parallel-size TENSOR_PARALLEL_SIZE] [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS] [--block-size {8,16,32}]
                     [--seed SEED] [--swap-space SWAP_SPACE] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION] [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS]
                     [--max-num-seqs MAX_NUM_SEQS] [--max-paddings MAX_PADDINGS] [--disable-log-stats] [--quantization {awq,gptq,squeezellm,None}]
                     [--enforce-eager] [--max-context-len-to-capture MAX_CONTEXT_LEN_TO_CAPTURE] [--disable-custom-all-reduce] [--enable-lora]
                     [--max-loras MAX_LORAS] [--max-lora-rank MAX_LORA_RANK] [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE]
                     [--lora-dtype {auto,float16,bfloat16,float32}] [--max-cpu-loras MAX_CPU_LORAS] [--device {cuda}] [--engine-use-ray] [--disable-log-requests]
                     [--max-log-len MAX_LOG_LEN]

@ywang96
Copy link
Member

ywang96 commented Mar 15, 2024

Hello @Kaotic3! This issue is closed because the api_server.py will no longer be maintained and the OpenAI compatible server is recommended instead. See the comments

NOTE: This API server is used only for demonstrating usage of AsyncEngine
and simple performance benchmarks. It is not intended for production use.
For production use, we recommend using our OpenAI compatible server.
We are also not going to accept PRs modifying this file, please
change `vllm/entrypoints/openai/api_server.py` instead.

@Kaotic3
Copy link

Kaotic3 commented Mar 15, 2024

Does this mean that we don't get the echo command then?

It isn't really even a problem with vLLM it is a problem with outlines served via vLLM - but I figured the flags are the same on both, so if it existed here, then I could use it in outlines.

@ywang96
Copy link
Member

ywang96 commented Mar 15, 2024

The echo option is only available in the openai compatible server afaik.

@Kaotic3
Copy link

Kaotic3 commented Mar 15, 2024

It isn't as that list is from the openai compatible server.

--echo isn't in the list.

Just to add because I know it might be confusing, the command for the openai compatible is:

python -m vllm.entrypoints.openai.api_server \
--model facebook/opt-125m

So it is still api_server.py - even when using the openai version.

@ywang96
Copy link
Member

ywang96 commented Mar 15, 2024

You don't pass it as an argument when launching the server. It should be specified in your payload to the v1/completions endpoint.

if request.echo and request.max_tokens == 0:
# only return the prompt
delta_text = res.prompt
delta_token_ids = res.prompt_token_ids
top_logprobs = res.prompt_logprobs
has_echoed[i] = True
elif (request.echo and request.max_tokens > 0
and not has_echoed[i]):
# echo the prompt and first token
delta_text = res.prompt + output.text
delta_token_ids = (res.prompt_token_ids +
output.token_ids)

@Kaotic3
Copy link

Kaotic3 commented Mar 15, 2024

I see, it is false by default so not really utilised it before.

I will try it with outlines, thanks man appreciate the assistance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants