Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added echo function to OpenAI API server. #1504

Merged
merged 9 commits into from
Nov 27, 2023

Conversation

wanmok
Copy link
Contributor

@wanmok wanmok commented Oct 30, 2023

The feature implements the echo function for OpenAI API server. It supersedes previous PR #959 on top of #1328. Also, it supports the issue #201.

For OpenAI API server, the PR makes following modifications:

  • Removed error message for requesting echo.
  • Modified create_logprobs function to reflect engine output changes.
  • Changed the behavior of streaming as in OpenAI's implementation, the number of streamed payloads basically equals to the number of tokens generated times number of choices; whereas in our previous implementation, it would yield another end of streaming empty token.
  • Edge case: added the support for the case of echo=True and max_tokens=0, which is a valid case in OpenAI APIs. In this case, the SamplingParams.max_tokens would be set to 1 and enable an echo_self flag. The flag is then used in the following post processing to ensure the additionally generated token being removed from the final output.

@wanmok
Copy link
Contributor Author

wanmok commented Oct 30, 2023

@zhuohan123 request to review.

@zhuohan123 zhuohan123 self-requested a review October 31, 2023 13:55
Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your contribution! Please check the comments.

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
Comment on lines 175 to 211
def create_logprobs(
token_ids: List[int],
top_logprobs: Optional[PromptLogprobs] = None,
num_output_top_logprobs: Optional[int] = None,
initial_text_offset: int = 0,
) -> LogProbs:
"""Create OpenAI-style logprobs."""
logprobs = LogProbs()
last_token_len = 0
for token_id, id_logprob in zip(token_ids, id_logprobs):
if num_output_top_logprobs:
logprobs.top_logprobs = []
for i, token_id in enumerate(token_ids):
step_top_logprobs = top_logprobs[i]
if step_top_logprobs is not None:
token_logprob = step_top_logprobs[token_id]
else:
token_logprob = None
token = tokenizer.convert_ids_to_tokens(token_id)
logprobs.tokens.append(token)
logprobs.token_logprobs.append(id_logprob[token_id])
logprobs.token_logprobs.append(token_logprob)
if len(logprobs.text_offset) == 0:
logprobs.text_offset.append(initial_text_offset)
else:
logprobs.text_offset.append(logprobs.text_offset[-1] +
last_token_len)
last_token_len = len(token)

logprobs.top_logprobs.append({
tokenizer.convert_ids_to_tokens(i): p
for i, p in id_logprob.items()
})
if num_output_top_logprobs:
logprobs.top_logprobs.append({
tokenizer.convert_ids_to_tokens(i): p
for i, p in step_top_logprobs.items()
# Filter out additional logprobs for the chosen token
# This ensures the same number of top logprobs requested
if not (len(step_top_logprobs) > num_output_top_logprobs
and i == token_id)
} if step_top_logprobs else None)
return logprobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to change this function? Originally this function is following OpenAI's specification:

Include the log probabilities on the logprobs most likely tokens, as well the chosen tokens. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens. The API will always return the logprob of the sampled token, so there may be up to logprobs+1 elements in the response.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually different from the implementation. Run:

completion = openai.Completion.create(
    model="text-davinci-003", prompt="A robot may not injure a human being", n=1, echo=True, best_of=1,
    logprobs=2, max_tokens=5, temperature=2)

Response:

/Users/yunmochen/anaconda3/envs/vllm/bin/python /Users/yunmochen/research/vllm/examples/openai_completion_client.py 
Completion results:
{
  "warning": "This model version is deprecated. Migrate before January 4, 2024 to avoid disruption of service. Learn more https://platform.openai.com/docs/deprecations",
  "id": "cmpl-8Ft6vvaOBBcbTtdxF3P4pLX1NrP58",
  "object": "text_completion",
  "created": 1698797457,
  "model": "text-davinci-003",
  "choices": [
    {
      "text": "A robot may not injure a human being or muddy pave collectorCy",
      "index": 0,
      "logprobs": {
        "tokens": [
          "A",
          " robot",
          " may",
          " not",
          " injure",
          " a",
          " human",
          " being",
          " or",
          " muddy",
          " pave",
          " collector",
          "Cy"
        ],
        "token_logprobs": [
          null,
          -10.330451,
          -0.605383,
          -3.694869,
          -0.12196065,
          -0.007913817,
          -0.0015808053,
          -0.0075414344,
          -0.049919397,
          -19.736057,
          -11.607917,
          -18.587305,
          -14.40027
        ],
        "top_logprobs": [
          null,
          {
            "pi": -3.0112953,
            ".": -3.0572982
          },
          {
            " may": -0.605383,
            " must": -2.4434366
          },
          {
            " be": -0.22049057,
            " perform": -3.1058216
          },
          {
            " injure": -0.12196065,
            " harm": -2.4350166
          },
          {
            " a": -0.007913817,
            " or": -6.1911945
          },
          {
            " human": -0.0015808053,
            "\n": -7.691319
          },
          {
            " being": -0.0075414344,
            " or": -6.095392
          },
          {
            " or": -0.049919397,
            ",": -3.4204276
          },
          {
            ",": -0.013517476,
            " through": -4.427107
          },
          {
            " through": -1.5732663,
            " a": -1.6822541
          },
          {
            "ments": -0.031058038,
            " the": -5.4342465
          },
          {
            ",": -0.95207053,
            " through": -1.8842442
          }
        ],
        "text_offset": [
          0,
          1,
          7,
          11,
          15,
          22,
          24,
          30,
          36,
          39,
          45,
          50,
          60
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 8,
    "completion_tokens": 5,
    "total_tokens": 13
  }
}

The API guarantees the log probabilities of chosen tokens present in token_logprobs but not top_logprobs.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as another datapoint, the llama-cpp-python OpenAI-spec server also includes the probabilities of chosen tokens in top_logprobs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just about the decision whether we would like to follow the service behavior implemented by OpenAI or the documented specifications.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow the OpenAI's API description and revert the changes here?

@wanmok
Copy link
Contributor Author

wanmok commented Nov 1, 2023

@zhuohan123 Except for the unresolved conversation, I've addressed comments.

@wanmok wanmok requested review from zhuohan123 and tmostak November 7, 2023 06:35
@wanmok
Copy link
Contributor Author

wanmok commented Nov 11, 2023

@zhuohan123 besides the design decision for the top logprob behavior, is there anything you would like to address?

@lifengjin
Copy link

any updates?

@wanmok
Copy link
Contributor Author

wanmok commented Nov 18, 2023

any updates?

I suppose that we’re still waiting for the team to get another round of review done. @zhuohan123 do we have any concern with merging the PR?

@tmostak
Copy link

tmostak commented Nov 19, 2023

@zhuohan123 would it be possible to get this PR re-reviewed/merged... we're heavily using this branch but would be much better if the code was in main.

Haven't reviewed the code but all seems to work well (aside from the open question on whether the logprobs of the chosen tokens should be returned in top_logprobs).

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi thanks for your contributions again! I believe after these changes the PR should be able to be merged. Could you remove all the other unnecessary style changes in this PR? Thanks

Comment on lines 175 to 211
def create_logprobs(
token_ids: List[int],
top_logprobs: Optional[PromptLogprobs] = None,
num_output_top_logprobs: Optional[int] = None,
initial_text_offset: int = 0,
) -> LogProbs:
"""Create OpenAI-style logprobs."""
logprobs = LogProbs()
last_token_len = 0
for token_id, id_logprob in zip(token_ids, id_logprobs):
if num_output_top_logprobs:
logprobs.top_logprobs = []
for i, token_id in enumerate(token_ids):
step_top_logprobs = top_logprobs[i]
if step_top_logprobs is not None:
token_logprob = step_top_logprobs[token_id]
else:
token_logprob = None
token = tokenizer.convert_ids_to_tokens(token_id)
logprobs.tokens.append(token)
logprobs.token_logprobs.append(id_logprob[token_id])
logprobs.token_logprobs.append(token_logprob)
if len(logprobs.text_offset) == 0:
logprobs.text_offset.append(initial_text_offset)
else:
logprobs.text_offset.append(logprobs.text_offset[-1] +
last_token_len)
last_token_len = len(token)

logprobs.top_logprobs.append({
tokenizer.convert_ids_to_tokens(i): p
for i, p in id_logprob.items()
})
if num_output_top_logprobs:
logprobs.top_logprobs.append({
tokenizer.convert_ids_to_tokens(i): p
for i, p in step_top_logprobs.items()
# Filter out additional logprobs for the chosen token
# This ensures the same number of top logprobs requested
if not (len(step_top_logprobs) > num_output_top_logprobs
and i == token_id)
} if step_top_logprobs else None)
return logprobs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's follow the OpenAI's API description and revert the changes here?

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/protocol.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
@wanmok wanmok requested a review from zhuohan123 November 25, 2023 15:42
@wanmok
Copy link
Contributor Author

wanmok commented Nov 25, 2023

@zhuohan123 I've addressed the comments. Please let me know if you found anything else to address.

@zhuohan123
Copy link
Member

zhuohan123 commented Nov 26, 2023

Hi thanks for fixing the issues! I tried running the following script to test:

import openai

# Modify OpenAI's API key and API base to use vLLM's API server.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

# List models API
models = openai.Model.list()
print("Models:", models)

model = models["data"][0]["id"]

# Completion API
stream = False
completion = openai.Completion.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=True,
    n=2,
    stream=stream,
    logprobs=3)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

And got the following error on the server side:

Traceback (most recent call last):
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/zhuohan/vllm/vllm/vllm/entrypoints/openai/api_server.py", line 594, in create_completion
    logprobs = create_logprobs(
  File "/home/zhuohan/vllm/vllm/vllm/entrypoints/openai/api_server.py", line 193, in create_logprobs
    for i, p in step_top_logprobs.items()
AttributeError: 'NoneType' object has no attribute 'items'

@wanmok Can you double check on your side? I believe the bug is caused by that the first token of the prompt does not have log probability, and thus its step_top_logprobs will be None.

@wanmok
Copy link
Contributor Author

wanmok commented Nov 26, 2023

Hi thanks for fixing the issues! I tried running the following script to test:

import openai

# Modify OpenAI's API key and API base to use vLLM's API server.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

# List models API
models = openai.Model.list()
print("Models:", models)

model = models["data"][0]["id"]

# Completion API
stream = False
completion = openai.Completion.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=True,
    n=2,
    stream=stream,
    logprobs=3)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

And got the following error on the server side:

Traceback (most recent call last):
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/zhuohan/vllm/vllm/vllm/entrypoints/openai/api_server.py", line 594, in create_completion
    logprobs = create_logprobs(
  File "/home/zhuohan/vllm/vllm/vllm/entrypoints/openai/api_server.py", line 193, in create_logprobs
    for i, p in step_top_logprobs.items()
AttributeError: 'NoneType' object has no attribute 'items'

@wanmok Can you double check on your side? I believe the bug is caused by that the first token of the prompt does not have log probability, and thus its step_top_logprobs will be None.

I have it fixed and tested. Please let me know if you found any other issues.

Copy link
Member

@zhuohan123 zhuohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for your contribution!

@zhuohan123 zhuohan123 merged commit 665cbce into vllm-project:main Nov 27, 2023
2 checks passed
@wheel-is
Copy link

wheel-is commented Dec 1, 2023

does this work? I'm getting "echo is not currently supported" @zhuohan123 @wanmok

@zhuohan123
Copy link
Member

does this work? I'm getting "echo is not currently supported" @zhuohan123 @wanmok

Are you using the latest main branch? This PR hasn't been included in the latest release yet.

@lifengjin
Copy link

When will this be released? I work in a cuda 11.8 environment, and I haven't figured out a way to compile the repo.

@wheel-is
Copy link

wheel-is commented Dec 2, 2023

I'm having the same problem as @lifengjin. How do you get a pull request to compile? I'm running into an abi compiler incompatibility error @zhuohan123 .

File "/root/miniconda3/envs/aphroditee/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/root/miniconda3/envs/aphroditee/lib/python3.10/runpy.py", line 110, in _get_module_details
import(pkg_name)
File "/vllm/vllm/init.py", line 3, in
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File "/vllm/vllm/engine/arg_utils.py", line 6, in
from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
File "/vllm/vllm/config.py", line 9, in
from vllm.utils import get_cpu_memory
File "/vllm/vllm/utils.py", line 8, in
from vllm._C import cuda_utils
ModuleNotFoundError: No module named 'vllm._C'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants