Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/async chat serving #2727

Merged
merged 13 commits into from
May 3, 2024
Next Next commit
Fixed chat serving init in async case
  • Loading branch information
schoennenbeck committed Feb 2, 2024
commit 64060f3899714d25c6a4bd9dda6fdb3e7659fe10
24 changes: 24 additions & 0 deletions tests/entrypoints/openai/test_serving_chat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import asyncio

import pytest

from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
from vllm import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine

MODEL_NAME = "HuggingFaceH4/zephyr-7b-beta" # any model with a chat template should work here
CHAT_TEMPLATE= "Dummy chat template for testing"


async def _async_serving_chat_init():
engine_args = AsyncEngineArgs(model=MODEL_NAME)
engine = AsyncLLMEngine.from_engine_args(engine_args)
serving_completion = OpenAIServingChat(
engine, served_model=MODEL_NAME, response_role="assistant", chat_template=CHAT_TEMPLATE
)
return serving_completion

def test_async_serving_chat_init():
serving_completion = asyncio.run(_async_serving_chat_init())
assert serving_completion.tokenizer is not None
assert serving_completion.tokenizer.chat_template==CHAT_TEMPLATE
4 changes: 2 additions & 2 deletions tests/entrypoints/test_openai_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ def __del__(self):
self.proc.terminate()


@pytest.fixture(scope="session")
@pytest.fixture(scope="module")
def server():
ray.init()
server_runner = ServerRunner.remote([
Expand All @@ -70,7 +70,7 @@ def server():
ray.shutdown()


@pytest.fixture(scope="session")
@pytest.fixture(scope="module")
def client():
client = openai.AsyncOpenAI(
base_url="http://localhost:8000/v1",
Expand Down
17 changes: 15 additions & 2 deletions vllm/entrypoints/openai/serving_chat.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import asyncio
import time
import codecs
from fastapi import Request
Expand Down Expand Up @@ -25,7 +26,16 @@ def __init__(self,
chat_template=None):
super().__init__(engine=engine, served_model=served_model)
self.response_role = response_role
self._load_chat_template(chat_template)
try:
event_loop = asyncio.get_running_loop()
except RuntimeError:
event_loop = None

if event_loop is not None and event_loop.is_running(
): # If the current is instanced by Ray Serve, there is already a running event loop
event_loop.create_task(self._load_chat_template(chat_template))
else: # When using single vLLM without engine_use_ray
asyncio.run(self._load_chat_template(chat_template))

async def create_chat_completion(
self, request: ChatCompletionRequest, raw_request: Request
Expand Down Expand Up @@ -242,7 +252,10 @@ async def chat_completion_full_generator(

return response

def _load_chat_template(self, chat_template):
async def _load_chat_template(self, chat_template):
while self.tokenizer is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is faulty. There is no code that actually initializes self.tokenizer = None before self._load_chat_template is called. You might want to instead use hasattr to check the existence of the attribute.

Copy link
Contributor Author

@schoennenbeck schoennenbeck May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 You are right. Until two weeks ago ServingEngine set self.tokenizer = None in its __init__ but that changed in this commit.
The tests still pass because by the time _load_chat_template is awaited the tokenizer is now already there (which was the idea behind this in the first place). How do you want to handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could open another PR simply replacing while self.tokenizer is None by while getattr(self, "tokenizer", None) is None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess inside tests/async_engine/test_chat_template.py, you can use another MockServingChat that doesn't have the chat_template attribute. However, I am doubtful about the value of such a test since it does not depend on OpenAIServing.__init__, making it useless if another commit changes the logic.

IMO It would be more useful to test a situation where the tokenizer takes considerably longer to load, making it more likely that the chat template will be accessed before the engine has fully loaded.

Copy link
Member

@DarkLight1337 DarkLight1337 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I think the current design is not that great as it puts async logic into __init__. It would be better if __init__ requires tokenizer and chat_template upfront so that developers are encouraged to place the async logic outside of the constructor.

To maintain the same functionality as the current __init__, we can have a separate async staticmethod factory that does both async and initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that sentiment. The only reason _post_init is async in the first place is that engine.get_model_config is async and in turn this is only async in order to enable the engine-workers to use ray. So 95% of the code is already synchronous and the remaining 5% are only artificially asynchronous to enable ray workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the chat_template used to be synchronous (before this PR) but this didn't mesh with the async code in the ServingEngine's __init__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I think the current design is not that great as it puts async logic into init. It would be better if init requires tokenizer and chat_template upfront so that developers are encouraged to place the async logic outside of the constructor.

To maintain the same functionality as the current init, we can have a separate async staticmethod factory that does both async and initialization.

If you don't mind, I'll work on a PR regarding this issue. This should make it no longer necessary to test the async behaviour of _load_chat_template, so the relevant tests will be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to open a PR, but I'm not sure what you mean regarding the tests. As long as OpenAIServingChat can still be initiated in an async function I'm fine with everything ;)

Copy link
Member

@DarkLight1337 DarkLight1337 May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that the chat template tests will be reverted to sync functions as opposed to async.

Edit: I have opened #4674, feel free to take a look.

# Give the parent class time to laod the tokenizer
await asyncio.sleep(0.1)
if chat_template is not None:
try:
with open(chat_template, "r") as f:
Expand Down