Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Frontend][OpenAI] Add support for OpenAI tools calling #4656

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Xwdit
Copy link

@Xwdit Xwdit commented May 7, 2024

This PR updated from #3237 , supported latest version of vLLM (v0.4.2), added more flexibility and fixed some issues.

vllm/entrypoints/openai/cli_args.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
@AaronFriel
Copy link

@Xwdit thanks for taking my feedback!

@Xwdit Xwdit force-pushed the tool_squashed branch 2 times, most recently from efbcf5f to eb9bc07 Compare May 8, 2024 23:38
Comment on lines +153 to +162
if stream:
text_message = ""
for chunk in response:
if chunk.choices[0].finish_reason is not None:
if chunk.choices[0].finish_reason == "tool_calls":
tool_calls += chunk.choices[0].delta.tool_calls
# print("TEST : %s" % chunk.choices[0].delta.tool_calls)
break
if chunk.choices[0].delta.content is not None:
text_message += chunk.choices[0].delta.content
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, probably the better way to handle this is to handle the response stream one chunk or token at a time. If you get a token indicating a tool call (such as <tool_call>) at the start of the response then you want to wait for the entire response from the LLM so that you can invoke the tool. if you get a non-meta or non-control token (e.g. a normal streaming text chat response) then you probably want to start showing the streaming tokens to the user immediately, avoiding the latency of the entire response. But, This is also an example so I'm aware it's not necessary for it to be optimized.

@K-Mistele
Copy link
Contributor

I have a couple ideas to add, that I can take a crack at this weekend if you would be open to it!

  • It might be be worthwhile to allow using jinja templating for defining the tool call system prompt to simplify and allow more flexibility and customization
  • It might be worthwhile to allow a user to specify somehow if they want guided generation ("outlines") to be used for tool calls. Some models (e.g. the Hermes 2 Pro Models by Nous Research, which this PR seems to default to in terms of the default templates) don't require guided generation for tool output, and guided generation can introduce overhead if the output choices (tool calls) are complex JSON objects. Notes on this from the vLLM discord here and here
  • The Hermes 2 models also use system prompts for the tool calls - this PR does not allow you to request/enforce that, but simple adds the tool call to the first message regardless of the role. it is probably a good idea to check if there is a system-role message and if so add the tool call stuff to it, and if not, then add the tool call stuff as a system-role message

Comment on lines +121 to +124
elif isinstance(request.messages,
List) and len(request.messages) >= 1:
request.messages[0].content = request.messages[
0].content.strip() + "\n\n" + text_inject
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just adds the system prompt to the first message. In almost every case tool calls should be a system prompt message (except for chat formats that don't do system prompts, but that's a marginal amount), so instead of checking for the first message and just adding the tool call prompt, it might be better to do something like this:

if isinstance(request.messages, str):
    # suggested change: for string prompts, put the tool usage before the prompt 
    request.messages = text_inject + '\n\n' + request.messages.strip() 
elif isinstance(request.messages, List) and len(request.messages) >= 1:
    # suggested change: check to see if the first message is a system message. If so, edit it.
    # otherwise, add a system prompt
    if request.messages[0].role == 'system':
        request.messages[0].content = request.messages[0].content.strip() + '\n\n' + text_inject
    else:
        request.messages = [{'role': 'system', 'content': text_inject}] + request.messages

@Xwdit
Copy link
Author

Xwdit commented May 11, 2024

I have a couple ideas to add, that I can take a crack at this weekend if you would be open to it!

  • It might be be worthwhile to allow using jinja templating for defining the tool call system prompt to simplify and allow more flexibility and customization
  • It might be worthwhile to allow a user to specify somehow if they want guided generation ("outlines") to be used for tool calls. Some models (e.g. the Hermes 2 Pro Models by Nous Research, which this PR seems to default to in terms of the default templates) don't require guided generation for tool output, and guided generation can introduce overhead if the output choices (tool calls) are complex JSON objects. Notes on this from the vLLM discord here and here
  • The Hermes 2 models also use system prompts for the tool calls - this PR does not allow you to request/enforce that, but simple adds the tool call to the first message regardless of the role. it is probably a good idea to check if there is a system-role message and if so add the tool call stuff to it, and if not, then add the tool call stuff as a system-role message

Hello, sorry I just saw this message; thank you very much for being willing to contribute to this PR and VLLM. If you are willing, I would greatly appreciate it

@K-Mistele
Copy link
Contributor

What is the preferred way to contribute to a PR? Should I create a new one with all the changes in this one and additions, or create a PR into your branch on your fork?

@Xwdit
Copy link
Author

Xwdit commented May 13, 2024

What is the preferred way to contribute to a PR? Should I create a new one with all the changes in this one and additions, or create a PR into your branch on your fork?

It's fine with me either way, do as you prefer

@Xwdit Xwdit force-pushed the tool_squashed branch 3 times, most recently from 1a5acc4 to 664642a Compare May 13, 2024 01:51
@Xwdit
Copy link
Author

Xwdit commented May 13, 2024

What is the preferred way to contribute to a PR? Should I create a new one with all the changes in this one and additions, or create a PR into your branch on your fork?

I have rebased the PR to the latest commit on vllm, don't forget to sync before you start working ;)

@Xwdit Xwdit force-pushed the tool_squashed branch 2 times, most recently from 3a2e8ef to 24e4c51 Compare May 14, 2024 15:17
@K-Mistele
Copy link
Contributor

I have been thinking about this a lot and I have a few more thoughts:

1. Ensuring that tools are in a system prompt

I think that this would be a good change to make. I added it in the branch that I'm making, and will PR it into this branch on your fork @Xwdit - then if/when it's merged, it will be reflected in this PR

2. Guided Generation / Outlines

After digging through the code some more, I realized that guided output is only forced if you use tool_choice, and isn't forced otherwise if you set it to auto - so my performance concerns shouldn't be an issue.

3. Tool Choice System Prompt Formatting

I was thinking more about how to handle tool usage and tool calling, and I think that it's probably just easiest and best to let people provide a jinja template for the tool usage system prompt, with a tools parameter that receives the list of tools specified in the OpenAI API - their template can specify how to process it.

Familiarity

Users are in many cases familiar with Jinja already, since it is commonly used for chat message formatting including in this framework. (for OpenAI API-compatible chat completions, see the --chat-template TEMPLATE CLI argument

Additionally, many models in hugging face specify a chat prompt jinja template in their tokenizer_config.json - if tool-using open models proliferate, we might expect that they may do the same for tool usage, in which case it would be easy to support in vLLM with minimal refactoring.

Flexibility & Future-Proof

Allowing vLLM users to specify their own Jinja template for the tool usage system prompt, similar to allowing them to specify a Jinja template for the chat prompt formatting, is flexible and allows for a wide variety of models and templates.

Users would not be locked in to the assumptions that we currently make about the tool usage system prompt instruction that are apparent in the structure of thetool_params argument that we currently use. It would be easy to add support for new open function-calling models that use a radically different format for the tool calling system prompt instruction by adding a new jinja template, instead of having to refactor core code.

Version Control

Using Jinja template for function calling system prompt formatting makes it easier for people to contribute additional jinja templates for their favorite models & have those be tracked in version control for easy usage, similar to how popular chat model templates are tracked as jinja templates (e.g. examples/template_chatml.jinja)

Please let me know what you think @Xwdit and if you agree or disagree. I haven't finished the implementation yet since it won't be a small amount of work, and I don't want to waste effort on it if you feel like jinja templates for tool usage system prompts isn't the right path forward.

@K-Mistele
Copy link
Contributor

If you agree that it's a good idea, I will be happy to take a crack at doing the jinja implementation myself :)

Co-Authored-By: FlorianJoncour <148003496+florianjoncour@users.noreply.github.com>
@Xwdit
Copy link
Author

Xwdit commented May 17, 2024

If you agree that it's a good idea, I will be happy to take a crack at doing the jinja implementation myself :)

Sorry for any late response 😢; All these ideas look great. Thank you very much for taking the time to work on this, I have no problem with them😉

@MaximillionMai
Copy link

Any updates on this PR?

@K-Mistele
Copy link
Contributor

work in progress still; I will try to knock this out in the next few days - got busy with some life stuff.

@K-Mistele
Copy link
Contributor

WIP on implementing the changes in #5649

@K-Mistele
Copy link
Contributor

K-Mistele commented Sep 25, 2024

hi @Xwdit this is closed by #5649 I think, which has now been merged :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants