Added echo function to OpenAI API server. #1504

wanmok · 2023-10-30T02:27:35Z

The feature implements the echo function for OpenAI API server. It supersedes previous PR #959 on top of #1328. Also, it supports the issue #201.

For OpenAI API server, the PR makes following modifications:

Removed error message for requesting echo.
Modified create_logprobs function to reflect engine output changes.
Changed the behavior of streaming as in OpenAI's implementation, the number of streamed payloads basically equals to the number of tokens generated times number of choices; whereas in our previous implementation, it would yield another end of streaming empty token.
Edge case: added the support for the case of echo=True and max_tokens=0, which is a valid case in OpenAI APIs. In this case, the SamplingParams.max_tokens would be set to 1 and enable an echo_self flag. The flag is then used in the following post processing to ensure the additionally generated token being removed from the final output.

wanmok · 2023-10-30T02:27:54Z

@zhuohan123 request to review.

zhuohan123

Thank you for your contribution! Please check the comments.

vllm/entrypoints/openai/api_server.py

zhuohan123 · 2023-10-31T15:06:29Z

vllm/entrypoints/openai/api_server.py

+def create_logprobs(
+    token_ids: List[int],
+    top_logprobs: Optional[PromptLogprobs] = None,
+    num_output_top_logprobs: Optional[int] = None,
+    initial_text_offset: int = 0,
+) -> LogProbs:
    """Create OpenAI-style logprobs."""
    logprobs = LogProbs()
    last_token_len = 0
-    for token_id, id_logprob in zip(token_ids, id_logprobs):
+    if num_output_top_logprobs:
+        logprobs.top_logprobs = []
+    for i, token_id in enumerate(token_ids):
+        step_top_logprobs = top_logprobs[i]
+        if step_top_logprobs is not None:
+            token_logprob = step_top_logprobs[token_id]
+        else:
+            token_logprob = None
        token = tokenizer.convert_ids_to_tokens(token_id)
        logprobs.tokens.append(token)
-        logprobs.token_logprobs.append(id_logprob[token_id])
+        logprobs.token_logprobs.append(token_logprob)
        if len(logprobs.text_offset) == 0:
            logprobs.text_offset.append(initial_text_offset)
        else:
            logprobs.text_offset.append(logprobs.text_offset[-1] +
                                        last_token_len)
        last_token_len = len(token)

-        logprobs.top_logprobs.append({
-            tokenizer.convert_ids_to_tokens(i): p
-            for i, p in id_logprob.items()
-        })
+        if num_output_top_logprobs:
+            logprobs.top_logprobs.append({
+                tokenizer.convert_ids_to_tokens(i): p
+                for i, p in step_top_logprobs.items()
+                # Filter out additional logprobs for the chosen token
+                # This ensures the same number of top logprobs requested
+                if not (len(step_top_logprobs) > num_output_top_logprobs
+                        and i == token_id)
+            } if step_top_logprobs else None)
    return logprobs


Why do we need to change this function? Originally this function is following OpenAI's specification:

Include the log probabilities on the logprobs most likely tokens, as well the chosen tokens. For example, if logprobs is 5, the API will return a list of the 5 most likely tokens. The API will always return the logprob of the sampled token, so there may be up to logprobs+1 elements in the response.

This is actually different from the implementation. Run:

completion = openai.Completion.create( model="text-davinci-003", prompt="A robot may not injure a human being", n=1, echo=True, best_of=1, logprobs=2, max_tokens=5, temperature=2)

Response:

/Users/yunmochen/anaconda3/envs/vllm/bin/python /Users/yunmochen/research/vllm/examples/openai_completion_client.py Completion results: { "warning": "This model version is deprecated. Migrate before January 4, 2024 to avoid disruption of service. Learn more https://platform.openai.com/docs/deprecations", "id": "cmpl-8Ft6vvaOBBcbTtdxF3P4pLX1NrP58", "object": "text_completion", "created": 1698797457, "model": "text-davinci-003", "choices": [ { "text": "A robot may not injure a human being or muddy pave collectorCy", "index": 0, "logprobs": { "tokens": [ "A", " robot", " may", " not", " injure", " a", " human", " being", " or", " muddy", " pave", " collector", "Cy" ], "token_logprobs": [ null, -10.330451, -0.605383, -3.694869, -0.12196065, -0.007913817, -0.0015808053, -0.0075414344, -0.049919397, -19.736057, -11.607917, -18.587305, -14.40027 ], "top_logprobs": [ null, { "pi": -3.0112953, ".": -3.0572982 }, { " may": -0.605383, " must": -2.4434366 }, { " be": -0.22049057, " perform": -3.1058216 }, { " injure": -0.12196065, " harm": -2.4350166 }, { " a": -0.007913817, " or": -6.1911945 }, { " human": -0.0015808053, "\n": -7.691319 }, { " being": -0.0075414344, " or": -6.095392 }, { " or": -0.049919397, ",": -3.4204276 }, { ",": -0.013517476, " through": -4.427107 }, { " through": -1.5732663, " a": -1.6822541 }, { "ments": -0.031058038, " the": -5.4342465 }, { ",": -0.95207053, " through": -1.8842442 } ], "text_offset": [ 0, 1, 7, 11, 15, 22, 24, 30, 36, 39, 45, 50, 60 ] }, "finish_reason": "length" } ], "usage": { "prompt_tokens": 8, "completion_tokens": 5, "total_tokens": 13 } }

The API guarantees the log probabilities of chosen tokens present in token_logprobs but not top_logprobs.

Just as another datapoint, the llama-cpp-python OpenAI-spec server also includes the probabilities of chosen tokens in top_logprobs.

It's just about the decision whether we would like to follow the service behavior implemented by OpenAI or the documented specifications.

Let's follow the OpenAI's API description and revert the changes here?

wanmok · 2023-11-01T06:26:30Z

@zhuohan123 Except for the unresolved conversation, I've addressed comments.

wanmok · 2023-11-11T18:58:15Z

@zhuohan123 besides the design decision for the top logprob behavior, is there anything you would like to address?

lifengjin · 2023-11-17T22:05:01Z

any updates?

wanmok · 2023-11-18T12:49:30Z

any updates?

I suppose that we’re still waiting for the team to get another round of review done. @zhuohan123 do we have any concern with merging the PR?

tmostak · 2023-11-19T17:39:52Z

@zhuohan123 would it be possible to get this PR re-reviewed/merged... we're heavily using this branch but would be much better if the code was in main.

Haven't reviewed the code but all seems to work well (aside from the open question on whether the logprobs of the chosen tokens should be returned in top_logprobs).

zhuohan123

Hi thanks for your contributions again! I believe after these changes the PR should be able to be merged. Could you remove all the other unnecessary style changes in this PR? Thanks

zhuohan123 · 2023-11-21T00:13:00Z

vllm/entrypoints/openai/api_server.py

+def create_logprobs(
+    token_ids: List[int],
+    top_logprobs: Optional[PromptLogprobs] = None,
+    num_output_top_logprobs: Optional[int] = None,
+    initial_text_offset: int = 0,
+) -> LogProbs:
    """Create OpenAI-style logprobs."""
    logprobs = LogProbs()
    last_token_len = 0
-    for token_id, id_logprob in zip(token_ids, id_logprobs):
+    if num_output_top_logprobs:
+        logprobs.top_logprobs = []
+    for i, token_id in enumerate(token_ids):
+        step_top_logprobs = top_logprobs[i]
+        if step_top_logprobs is not None:
+            token_logprob = step_top_logprobs[token_id]
+        else:
+            token_logprob = None
        token = tokenizer.convert_ids_to_tokens(token_id)
        logprobs.tokens.append(token)
-        logprobs.token_logprobs.append(id_logprob[token_id])
+        logprobs.token_logprobs.append(token_logprob)
        if len(logprobs.text_offset) == 0:
            logprobs.text_offset.append(initial_text_offset)
        else:
            logprobs.text_offset.append(logprobs.text_offset[-1] +
                                        last_token_len)
        last_token_len = len(token)

-        logprobs.top_logprobs.append({
-            tokenizer.convert_ids_to_tokens(i): p
-            for i, p in id_logprob.items()
-        })
+        if num_output_top_logprobs:
+            logprobs.top_logprobs.append({
+                tokenizer.convert_ids_to_tokens(i): p
+                for i, p in step_top_logprobs.items()
+                # Filter out additional logprobs for the chosen token
+                # This ensures the same number of top logprobs requested
+                if not (len(step_top_logprobs) > num_output_top_logprobs
+                        and i == token_id)
+            } if step_top_logprobs else None)
    return logprobs


Let's follow the OpenAI's API description and revert the changes here?

vllm/entrypoints/openai/api_server.py

vllm/entrypoints/openai/protocol.py

vllm/entrypoints/openai/api_server.py

wanmok · 2023-11-25T15:44:14Z

@zhuohan123 I've addressed the comments. Please let me know if you found anything else to address.

zhuohan123 · 2023-11-26T02:52:14Z

Hi thanks for fixing the issues! I tried running the following script to test:

import openai

# Modify OpenAI's API key and API base to use vLLM's API server.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

# List models API
models = openai.Model.list()
print("Models:", models)

model = models["data"][0]["id"]

# Completion API
stream = False
completion = openai.Completion.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=True,
    n=2,
    stream=stream,
    logprobs=3)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

And got the following error on the server side:

Traceback (most recent call last):
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/zhuohan/vllm/vllm/vllm/entrypoints/openai/api_server.py", line 594, in create_completion
    logprobs = create_logprobs(
  File "/home/zhuohan/vllm/vllm/vllm/entrypoints/openai/api_server.py", line 193, in create_logprobs
    for i, p in step_top_logprobs.items()
AttributeError: 'NoneType' object has no attribute 'items'

@wanmok Can you double check on your side? I believe the bug is caused by that the first token of the prompt does not have log probability, and thus its step_top_logprobs will be None.

wanmok · 2023-11-26T13:02:59Z

Hi thanks for fixing the issues! I tried running the following script to test:

import openai

# Modify OpenAI's API key and API base to use vLLM's API server.
openai.api_key = "EMPTY"
openai.api_base = "http://localhost:8000/v1"

# List models API
models = openai.Model.list()
print("Models:", models)

model = models["data"][0]["id"]

# Completion API
stream = False
completion = openai.Completion.create(
    model=model,
    prompt="A robot may not injure a human being",
    echo=True,
    n=2,
    stream=stream,
    logprobs=3)

print("Completion results:")
if stream:
    for c in completion:
        print(c)
else:
    print(completion)

And got the following error on the server side:

Traceback (most recent call last):
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/protocols/http/httptools_impl.py", line 426, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 84, in __call__
    return await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/applications.py", line 289, in __call__
    await super().__call__(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/applications.py", line 122, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 184, in __call__
    raise exc
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/errors.py", line 162, in __call__
    await self.app(scope, receive, _send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/cors.py", line 83, in __call__
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 79, in __call__
    raise exc
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 68, in __call__
    await self.app(scope, receive, sender)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 20, in __call__
    raise e
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/middleware/asyncexitstack.py", line 17, in __call__
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 718, in __call__
    await route.handle(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 276, in handle
    await self.app(scope, receive, send)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/starlette/routing.py", line 66, in app
    response = await func(request)
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 273, in app
    raw_response = await run_endpoint_function(
  File "/home/zhuohan/anaconda3/envs/vllm/lib/python3.10/site-packages/fastapi/routing.py", line 190, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/zhuohan/vllm/vllm/vllm/entrypoints/openai/api_server.py", line 594, in create_completion
    logprobs = create_logprobs(
  File "/home/zhuohan/vllm/vllm/vllm/entrypoints/openai/api_server.py", line 193, in create_logprobs
    for i, p in step_top_logprobs.items()
AttributeError: 'NoneType' object has no attribute 'items'

@wanmok Can you double check on your side? I believe the bug is caused by that the first token of the prompt does not have log probability, and thus its step_top_logprobs will be None.

I have it fixed and tested. Please let me know if you found any other issues.

zhuohan123

LGTM! Thank you for your contribution!

wheel-is · 2023-12-01T22:27:29Z

does this work? I'm getting "echo is not currently supported" @zhuohan123 @wanmok

zhuohan123 · 2023-12-01T22:34:46Z

does this work? I'm getting "echo is not currently supported" @zhuohan123 @wanmok

Are you using the latest main branch? This PR hasn't been included in the latest release yet.

lifengjin · 2023-12-01T23:29:31Z

When will this be released? I work in a cuda 11.8 environment, and I haven't figured out a way to compile the repo.

wheel-is · 2023-12-02T00:58:51Z

I'm having the same problem as @lifengjin. How do you get a pull request to compile? I'm running into an abi compiler incompatibility error @zhuohan123 .

File "/root/miniconda3/envs/aphroditee/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/root/miniconda3/envs/aphroditee/lib/python3.10/runpy.py", line 110, in _get_module_details
import(pkg_name)
File "/vllm/vllm/init.py", line 3, in
from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs
File "/vllm/vllm/engine/arg_utils.py", line 6, in
from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
File "/vllm/vllm/config.py", line 9, in
from vllm.utils import get_cpu_memory
File "/vllm/vllm/utils.py", line 8, in
from vllm._C import cuda_utils
ModuleNotFoundError: No module named 'vllm._C'

Added echo function to OpenAI API server.

8759aeb

zhuohan123 mentioned this pull request Oct 30, 2023

[WIP] Echo prompt tokens #833

Closed

zhuohan123 self-requested a review October 31, 2023 13:55

zhuohan123 requested changes Oct 31, 2023

View reviewed changes

Addressed comments.

7eac870

Merge branch 'vllm-project:main' into openai-echo

e3189f4

wanmok requested review from zhuohan123 and tmostak November 7, 2023 06:35

Merge branch 'main' into openai-echo

b7fb26d

zhuohan123 requested changes Nov 21, 2023

View reviewed changes

wanmok and others added 4 commits November 25, 2023 22:17

Merge branch 'main' into openai-echo

5c0a721

Updated

273f0c9

Updated to address revision comments.

7690b6d

Updated to address CI.

696db2d

wanmok requested a review from zhuohan123 November 25, 2023 15:42

Fixed the error.

e347362

zhuohan123 approved these changes Nov 27, 2023

View reviewed changes

zhuohan123 merged commit 665cbce into vllm-project:main Nov 27, 2023

xjpang pushed a commit to xjpang/vllm that referenced this pull request Dec 4, 2023

Added echo function to OpenAI API server. (vllm-project#1504)

8951aad

casper-hansen mentioned this pull request Dec 7, 2023

Streaming broken in OpenAI server in v0.2.3 (0.2.2 works) #1967

Closed

hmellor mentioned this pull request Feb 2, 2024

Implementing Echo in OpenAI endpoint #201

Closed

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Added echo function to OpenAI API server. (vllm-project#1504)

0ea6483

Uh oh!

Added echo function to OpenAI API server. #1504

Added echo function to OpenAI API server. #1504

Uh oh!

Conversation

wanmok commented Oct 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanmok commented Oct 30, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zhuohan123 Oct 31, 2023

Choose a reason for hiding this comment

Uh oh!

wanmok Nov 1, 2023

Choose a reason for hiding this comment

Uh oh!

tmostak Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

wanmok Nov 3, 2023

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

wanmok commented Nov 1, 2023

Uh oh!

wanmok commented Nov 11, 2023

Uh oh!

lifengjin commented Nov 17, 2023

Uh oh!

wanmok commented Nov 18, 2023

Uh oh!

tmostak commented Nov 19, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

zhuohan123 Nov 21, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wanmok commented Nov 25, 2023

Uh oh!

zhuohan123 commented Nov 26, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wanmok commented Nov 26, 2023

Uh oh!

zhuohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

wheel-is commented Dec 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhuohan123 commented Dec 1, 2023

Uh oh!

lifengjin commented Dec 1, 2023

Uh oh!

wheel-is commented Dec 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

wanmok commented Oct 30, 2023 •

edited

Loading

zhuohan123 commented Nov 26, 2023 •

edited

Loading

wheel-is commented Dec 1, 2023 •

edited

Loading

wheel-is commented Dec 2, 2023 •

edited

Loading