Description
I am working with the OpenAI-serving-engines from the current main branch (python 3.10).
When I try to instantiate an OpenAIServingChat
from a coroutine I get the error message AttributeError: 'NoneType' object has no attribute 'chat_template'
.
Code Example
Here is some sample code to replicate the problem:
from vllm import AsyncEngineArgs
from vllm.engine.async_llm_engine import AsyncLLMEngine
from vllm.entrypoints.openai.serving_chat import OpenAIServingChat
import asyncio
async def main():
model = "microsoft/phi-2"
engine_args = AsyncEngineArgs(model=model)
engine = AsyncLLMEngine.from_engine_args(engine_args)
serving_chat = OpenAIServingChat(
engine,
served_model=model,
response_role="assistant",
chat_template=None,
)
if __name__ == "__main__":
asyncio.run(main())
If I turn the main-coroutine into a function (just removing the async
) and just run it directly (without asyncio
) everything works as expected.
Problem Investigation
From what I can tell the problem is as follows:
In the __init__
for OpenAIServing
link lines 25ff read:
try:
event_loop = asyncio.get_running_loop()
except RuntimeError:
event_loop = None
if event_loop is not None and event_loop.is_running(
): # If the current is instanced by Ray Serve, there is already a running event loop
event_loop.create_task(self._post_init())
else: # When using single vLLM without engine_use_ray
asyncio.run(self._post_init())
Synchronous Case
In the case of a synchronous main function above we enter the else
-portion at the bottom in which case asyncio
starts a new event loop, runs self._post_init()
in it (which loads the tokenizer) and only returns once that has happened. That means the tokenizer is available when OpenAIServingChat
calls self._load_chat_template()
link in its __init__
.
Asynchronous Case
In the case of an asynchronous-main-coroutine above there already is an event loop. Consequently event_loop.create_task(self._post_init())
is called which schedules the tokenizer-loading to be done at some point in the future. However, we do not hit an await
before OpenAIServingChat
calls self._load_chat_template()
so the loop never gets the chance to actually load the tokenizer so it is not there when self._load_chat_template()
tries to access it.
Possible solutions
I am not an expert in asyncio-programming so the only solution I found so far is to make _load_chat_template
in OpenAIServingChat
async as well and replicate the who event-loop/create_task-logic from OpenAIServing
's __init__
for the chat-template-loading in the __init__
of OpenAIServingChat
. Experimentally that seems to work; however, this doesn't seem like a good solution since I don't think there is any guarantee on the order in which tasks are run by the event-loop so there still could be scenarios in which the error is triggered.
Edit: This does seem to be the only workable solution. To ensure stuff is run in the correct order _load_chat_template
will have to wait until the tokenizer is available, e.g.
async def _load_chat_template(self, chat_template):
while self.tokenizer is None:
await asyncio.sleep(.01)
...
Additional Observation
Interestingly the error is not triggered when using engine_use_ray=True
or workers_use_ray=True
in a synchronous-main-function. It appears that at the time of calling the __init__
there is not yet a running event loop so we again hit the working else
-case.
Activity