Skip to content

vLLM Model Provider implementation #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 15 commits into from

Conversation

AhilanPonnusamy
Copy link

@AhilanPonnusamy AhilanPonnusamy commented May 19, 2025

Description

vLLM Model provider Implementation support feature.

Related Issues

#43

Documentation PR

[Link to related associated PR in the agent-docs repo]

Type of Change

  • New feature

[Choose one of the above types of changes]

Testing

Built a sample client to test the implementation

  • hatch fmt --linter
  • hatch fmt --formatter
  • hatch test --all
  • Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

Checklist

  • I have read the CONTRIBUTING document
  • I have added tests that prove my fix is effective or my feature works
  • I have updated the documentation accordingly
  • I have added an appropriate example to the documentation to outline the feature
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

vLLM Model Provider
Added vLLM model provider example
vLLM Model Provider test cases
vLLM Model Provider Integration tests
@AhilanPonnusamy AhilanPonnusamy requested a review from a team as a code owner May 19, 2025 04:13
@AhilanPonnusamy
Copy link
Author

updated description

fixed vllm version to 0.8.5
class VLLMModel(Model):
"""vLLM model provider implementation.

Assumes OpenAI-compatible vLLM server at `http://<host>/v1/completions`.
Copy link
Member

@pgrayy pgrayy May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since vLLM assumes OpenAI compatibility (docs), I think it actually makes sense for us to generalize here and instead of creating a VLLMModel provider, we create an OpenAIModel provider. This is something the Strands Agents team is already discussing as we have other OpenAI compatible providers that could all share the format_request and format_chunk logic.

With that said, I would suggest keeping this PR open for the time being as we further work out the details. Thank you for your contribution and patience.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it @pgrayy, will wait for the updates. The one thing we will miss out on this approach is the support for the native vLLM APIs via native vLLM endpoint, designed for direct use with the vLLM engine. Do you foresee just that aspect being a vLLM Model Provider by itself? I see this VLLModel provider gradually built to cover all aspects of vLLM including OpenAI compatible end points.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR for the OpenAI model provider. This should work to connect to models served with vLLM.

Regarding your question, could you elaborate on what you mean by native vLLM endpoint? It should still be OpenAI compatible correct? Based on what I am reading in the docs, you can query vLLM using the openai client, which is what this new Strands OpenAI model provider uses under the hood.

Copy link
Author

@AhilanPonnusamy AhilanPonnusamy May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @pgrayy , Regarding vLLM endpoints, there are two endpoints provided by vLLM.

  1. openai.api_server in vLLM provides an OpenAI-compatible REST API layer, mimicking the behavior and request/response format of OpenAI's API (/v1/chat/completions, etc.).This layer supports both standard and streaming completion modes.

  2. api_server, on the other hand, is vLLM’s native API server, offering endpoints like /generate or /completion, designed specifically for internal or custom integrations.While api_server is more flexible and may expose additional low-level features, openai.api_server ensures broader compatibility with the OpenAI ecosystem.

You can run either by setting the appropriate flag in vllm.entrypoints.openai.api_server or vllm.entrypoints.api_server when launching the server.

On another note, I tried to test the OpenAI Model provider from PR against my local vLLM instance. I couldn't pass the API_KEY stage even after setting the env variable "openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: empty. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}" .

As you suggested, shall we keep this vLLM Model Provider PR open until we get the OpenAI one working, or would it be better to merge it for now and deprecate it later if it's deemed redundant?

Copy link
Member

@pgrayy pgrayy May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share more details on your testing? What happens when you try the following:

from strands.models.openai import OpenAIModel

openai_model = OpenAIModel({"api_key": "<YOUR_API_KEY>", "base_url": "<YOUR_MODEL_ENDPOINT>"}, model_id="<YOUR_MODEL_ID>")

api_key will need to be explicitly passed into the model provider unless you set the OPENAI_API_KEY environment variable.

Copy link
Author

@AhilanPonnusamy AhilanPonnusamy May 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No luck @pgrayy please find below

call

OpenAIModel(
api_key="abc123",
base_url="http://localhost:8000",
model_id="Qwen/Qwen3-4B", # Qwen/Qwen3-8B

Error

openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: abc123. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Also, I think the way the tool_calls are emitted will be different for different Model Providers. I had to convert the tool_call to a function call with a different format to make it work for vLLM Model Provider implementation as you see in my code.

Copy link
Member

@pgrayy pgrayy May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to pass in api_key to the client_args. Here are some steps I took for testing:

Setup a local vllm server:

$ python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --max-model-len 4096

I set hermes as a tool parser based on instructions here. This seems to be the mechanism to make tool calls OpenAI compatible and so we shouldn't need any special handling in our code.

Next, I setup my agent script passing in api_key to the client_args.

from strands import Agent
from strands.models.openai import OpenAIModel
from strands_tools import calculator

model = OpenAIModel(
    # can also pass dict as first argument
    client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"},
    # everything from this point is a kwarg though
    model_id="Qwen/Qwen2.5-7B",
)

agent = Agent(model=model, tools=[calculator])
agent("What is 2+2?")

Result:

$ python test_vllm.py
Tool #1: calculator
The result of 2 + 2 is 4.

For this to work, you'll also need to include changes in #97 (already merged into main).

As an alternative, you should also be able to get this working with the LiteLLMModel provider. The LiteLLM docs have a dedicated page on vllm (see here).

fixed bugs
Adjusted to the newly updated code
Fixed for new code
Fixed tool usage error fix and updated comments
Updated test cases
strandardized assertion
fixed the assert to check the correc tool
@AhilanPonnusamy
Copy link
Author

@pgrayy thank you for the details . I test the latest OpenAI.py implementation with (#97). It works fine for hermes template, but it fails for llama3_json template with llama models (which is what I have implemented in the PR). Hermes and llama3_json seem to be the widely used templates but there are few others as well. If you still want to only have a shared OpenAI Model Provider implementation, I suggest you to integrate to llama3_json template tool_call handling from this PR as well. However, if you like to keep vLLM Model Provider seperate, with initial implementation supporting hermes and llama3_json templates and let future contributions support other template and native vLLM endpoint. We can update this PR code to use the OpenAI Model Provider for hermes and OpenAI endpoint and use this implementation for OpenAI and llam3_json endpoint. Hope this helps

@pgrayy
Copy link
Member

pgrayy commented May 27, 2025

@pgrayy thank you for the details . I test the latest OpenAI.py implementation with (#97). It works fine for hermes template, but it fails for llama3_json template with llama models (which is what I have implemented in the PR). Hermes and llama3_json seem to be the widely used templates but there are few others as well. If you still want to only have a shared OpenAI Model Provider implementation, I suggest you to integrate to llama3_json template tool_call handling from this PR as well. However, if you like to keep vLLM Model Provider seperate, with initial implementation supporting hermes and llama3_json templates and let future contributions support other template and native vLLM endpoint. We can update this PR code to use the OpenAI Model Provider for hermes and OpenAI endpoint and use this implementation for OpenAI and llam3_json endpoint. Hope this helps

Can you please elaborate on what failures you encountered using Llama? The following example worked for me:

Local server:

$ python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --max-model-len 4096

Script:

from strands import Agent
from strands.models.openai import OpenAIModel
from strands_tools import calculator

model = OpenAIModel(
    client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"},
    model_id="meta-llama/Llama-3.2-3B-Instruct",
)

agent = Agent(model=model, tools=[calculator])
agent("What is 2+2?")

Result:

The calculator function was called with the expression 2+2. The response from the calculator function was a JSON object containing a content array with a text property and a status property. The text property contained the result of the calculation, which is 4, and the status property was success.

@AhilanPonnusamy
Copy link
Author

AhilanPonnusamy commented May 27, 2025 via email

@pgrayy
Copy link
Member

pgrayy commented May 27, 2025

I have since dismantled the environment. However, I saw the tool call was not invoked when I tried to use the calculate_time tool for Melbourne, Australia. It simply printed the tool as text inside <tool_call></tool_call> tag. Even in your output I do not see Tool #1 tag that seems to be printed as part of tool handling in Strands SDK. HTH, try with say calculate_time tool that will clearly show if the tool is being used or not.

On Wed, May 28, 2025 at 1:21 AM Patrick Gray @.> wrote: pgrayy left a comment (strands-agents/sdk-python#44) <#44 (comment)> @pgrayy https://github.com/pgrayy thank you for the details . I test the latest OpenAI.py implementation with (#97 <#97>). It works fine for hermes template, but it fails for llama3_json template with llama models (which is what I have implemented in the PR). Hermes and llama3_json seem to be the widely used templates but there are few others as well. If you still want to only have a shared OpenAI Model Provider implementation, I suggest you to integrate to llama3_json template tool_call handling from this PR as well. However, if you like to keep vLLM Model Provider seperate, with initial implementation supporting hermes and llama3_json templates and let future contributions support other template and native vLLM endpoint. We can update this PR code to use the OpenAI Model Provider for hermes and OpenAI endpoint and use this implementation for OpenAI and llam3_json endpoint. Hope this helps Can you please elaborate on what failures you encountered using Llama. The following example worked for me: Local server: $ python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser llama3_json \ --max-model-len 4096 Script: from strands import Agentfrom strands.models.openai import OpenAIModelfrom strands_tools import calculator model = OpenAIModel( client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"}, model_id="meta-llama/Llama-3.2-3B-Instruct", ) agent = Agent(model=model, tools=[calculator])agent("What is 2+2?") Result: The calculator function was called with the expression 2+2. The response from the calculator function was a JSON object containing a content array with a text property and a status property. The text property contained the result of the calculation, which is 4, and the status property was success. — Reply to this email directly, view it on GitHub <#44 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF43WBCMUUQVJQBTRP3A4LD3AR7GDAVCNFSM6AAAAAB5MT57F6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMJSHE4TCNRRHE . You are receiving this because you authored the thread.Message ID: @.>

The Tool #1 tag was printed to stdout. I only shared the final response. What you described seems to be a situation where the model itself returned the tool call in the text content block instead of in the tool_calls field. I have seen this happen with less powerful models. Sometimes if you repeat the call enough times, the model will generate a correct response. For more consistent results however, I would recommend using a newer/larger version of Llama.

In short, this is a problem with the model, not the model provider or vLLM.

@pgrayy
Copy link
Member

pgrayy commented May 29, 2025

After review, we decided that we will continue to encourage customers to use either the OpenAI or LiteLLM model providers to interact with models served by vLLM. If you have any more concerns or questions, please don't hesitate to raise an issue or start a discussion. Thank you for this discussion. We really appreciate your contributions.

@pgrayy pgrayy closed this May 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants