vLLM Model Provider implementation #44

AhilanPonnusamy · 2025-05-19T04:13:26Z

Description

vLLM Model provider Implementation support feature.

Related Issues

#43

Documentation PR

[Link to related associated PR in the agent-docs repo]

Type of Change

New feature

[Choose one of the above types of changes]

Testing

Built a sample client to test the implementation

hatch fmt --linter
hatch fmt --formatter
hatch test --all
Verify that the changes do not break functionality or introduce warnings in consuming repositories: agents-docs, agents-tools, agents-cli

Checklist

I have read the CONTRIBUTING document
I have added tests that prove my fix is effective or my feature works
I have updated the documentation accordingly
I have added an appropriate example to the documentation to outline the feature
My changes generate no new warnings
Any dependent changes have been merged and published

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

vLLM Model Provider

Added vLLM model provider example

vLLM Model Provider test cases

vLLM Model Provider Integration tests

AhilanPonnusamy · 2025-05-19T04:15:51Z

updated description

fixed vllm version to 0.8.5

pgrayy · 2025-05-19T18:40:26Z

src/strands/models/vllm.py

+class VLLMModel(Model):
+    """vLLM model provider implementation.
+
+    Assumes OpenAI-compatible vLLM server at `http://<host>/v1/completions`.


Since vLLM assumes OpenAI compatibility (docs), I think it actually makes sense for us to generalize here and instead of creating a VLLMModel provider, we create an OpenAIModel provider. This is something the Strands Agents team is already discussing as we have other OpenAI compatible providers that could all share the format_request and format_chunk logic.

With that said, I would suggest keeping this PR open for the time being as we further work out the details. Thank you for your contribution and patience.

Got it @pgrayy, will wait for the updates. The one thing we will miss out on this approach is the support for the native vLLM APIs via native vLLM endpoint, designed for direct use with the vLLM engine. Do you foresee just that aspect being a vLLM Model Provider by itself? I see this VLLModel provider gradually built to cover all aspects of vLLM including OpenAI compatible end points.

Here is the PR for the OpenAI model provider. This should work to connect to models served with vLLM.

Regarding your question, could you elaborate on what you mean by native vLLM endpoint? It should still be OpenAI compatible correct? Based on what I am reading in the docs, you can query vLLM using the openai client, which is what this new Strands OpenAI model provider uses under the hood.

Thank you @pgrayy , Regarding vLLM endpoints, there are two endpoints provided by vLLM.

openai.api_server in vLLM provides an OpenAI-compatible REST API layer, mimicking the behavior and request/response format of OpenAI's API (/v1/chat/completions, etc.).This layer supports both standard and streaming completion modes.

api_server, on the other hand, is vLLM’s native API server, offering endpoints like /generate or /completion, designed specifically for internal or custom integrations.While api_server is more flexible and may expose additional low-level features, openai.api_server ensures broader compatibility with the OpenAI ecosystem.

You can run either by setting the appropriate flag in vllm.entrypoints.openai.api_server or vllm.entrypoints.api_server when launching the server.

On another note, I tried to test the OpenAI Model provider from PR against my local vLLM instance. I couldn't pass the API_KEY stage even after setting the env variable "openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: empty. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}" .

As you suggested, shall we keep this vLLM Model Provider PR open until we get the OpenAI one working, or would it be better to merge it for now and deprecate it later if it's deemed redundant?

Could you share more details on your testing? What happens when you try the following:

from strands.models.openai import OpenAIModel openai_model = OpenAIModel({"api_key": "<YOUR_API_KEY>", "base_url": "<YOUR_MODEL_ENDPOINT>"}, model_id="<YOUR_MODEL_ID>")

api_key will need to be explicitly passed into the model provider unless you set the OPENAI_API_KEY environment variable.

No luck @pgrayy please find below

call

OpenAIModel(
api_key="abc123",
base_url="http://localhost:8000",
model_id="Qwen/Qwen3-4B", # Qwen/Qwen3-8B

Error

openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: abc123. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Also, I think the way the tool_calls are emitted will be different for different Model Providers. I had to convert the tool_call to a function call with a different format to make it work for vLLM Model Provider implementation as you see in my code.

You will need to pass in api_key to the client_args. Here are some steps I took for testing:

Setup a local vllm server:

$ python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-7B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser hermes \ --max-model-len 4096

I set hermes as a tool parser based on instructions here. This seems to be the mechanism to make tool calls OpenAI compatible and so we shouldn't need any special handling in our code.

Next, I setup my agent script passing in api_key to the client_args.

from strands import Agent from strands.models.openai import OpenAIModel from strands_tools import calculator model = OpenAIModel( # can also pass dict as first argument client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"}, # everything from this point is a kwarg though model_id="Qwen/Qwen2.5-7B", ) agent = Agent(model=model, tools=[calculator]) agent("What is 2+2?")

Result:

$ python test_vllm.py Tool #1: calculator The result of 2 + 2 is 4.

For this to work, you'll also need to include changes in #97 (already merged into main).

As an alternative, you should also be able to get this working with the LiteLLMModel provider. The LiteLLM docs have a dedicated page on vllm (see here).

fixed bugs

Adjusted to the newly updated code

Fixed for new code

Fixed tool usage error fix and updated comments

Updated test cases

strandardized assertion

fixed the assert to check the correc tool

AhilanPonnusamy · 2025-05-24T07:58:21Z

@pgrayy thank you for the details . I test the latest OpenAI.py implementation with (#97). It works fine for hermes template, but it fails for llama3_json template with llama models (which is what I have implemented in the PR). Hermes and llama3_json seem to be the widely used templates but there are few others as well. If you still want to only have a shared OpenAI Model Provider implementation, I suggest you to integrate to llama3_json template tool_call handling from this PR as well. However, if you like to keep vLLM Model Provider seperate, with initial implementation supporting hermes and llama3_json templates and let future contributions support other template and native vLLM endpoint. We can update this PR code to use the OpenAI Model Provider for hermes and OpenAI endpoint and use this implementation for OpenAI and llam3_json endpoint. Hope this helps

pgrayy · 2025-05-27T15:20:42Z

@pgrayy thank you for the details . I test the latest OpenAI.py implementation with (#97). It works fine for hermes template, but it fails for llama3_json template with llama models (which is what I have implemented in the PR). Hermes and llama3_json seem to be the widely used templates but there are few others as well. If you still want to only have a shared OpenAI Model Provider implementation, I suggest you to integrate to llama3_json template tool_call handling from this PR as well. However, if you like to keep vLLM Model Provider seperate, with initial implementation supporting hermes and llama3_json templates and let future contributions support other template and native vLLM endpoint. We can update this PR code to use the OpenAI Model Provider for hermes and OpenAI endpoint and use this implementation for OpenAI and llam3_json endpoint. Hope this helps

Can you please elaborate on what failures you encountered using Llama? The following example worked for me:

Local server:

$ python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-3.2-3B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --enable-auto-tool-choice \
    --tool-call-parser llama3_json \
    --max-model-len 4096

Script:

from strands import Agent
from strands.models.openai import OpenAIModel
from strands_tools import calculator

model = OpenAIModel(
    client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"},
    model_id="meta-llama/Llama-3.2-3B-Instruct",
)

agent = Agent(model=model, tools=[calculator])
agent("What is 2+2?")

Result:

The calculator function was called with the expression 2+2. The response from the calculator function was a JSON object containing a content array with a text property and a status property. The text property contained the result of the calculation, which is 4, and the status property was success.

AhilanPonnusamy · 2025-05-27T22:24:58Z

I have since dismantled the environment. However, I saw the tool call was not invoked when I tried to use the calculate_time tool for Melbourne, Australia. It simply printed the tool as text inside <tool_call></tool_call> tag. Even in your output I do not see Tool #1 tag that seems to be printed as part of tool handling in Strands SDK. HTH, try with say calculate_time tool that will clearly show if the tool is being used or not.

…

On Wed, May 28, 2025 at 1:21 AM Patrick Gray ***@***.***> wrote: *pgrayy* left a comment (strands-agents/sdk-python#44) <#44 (comment)> @pgrayy <https://github.com/pgrayy> thank you for the details . I test the latest OpenAI.py implementation with (#97 <#97>). It works fine for hermes template, but it fails for llama3_json template with llama models (which is what I have implemented in the PR). Hermes and llama3_json seem to be the widely used templates but there are few others as well. If you still want to only have a shared OpenAI Model Provider implementation, I suggest you to integrate to llama3_json template tool_call handling from this PR as well. However, if you like to keep vLLM Model Provider seperate, with initial implementation supporting hermes and llama3_json templates and let future contributions support other template and native vLLM endpoint. We can update this PR code to use the OpenAI Model Provider for hermes and OpenAI endpoint and use this implementation for OpenAI and llam3_json endpoint. Hope this helps Can you please elaborate on what failures you encountered using Llama. The following example worked for me: Local server: $ python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser llama3_json \ --max-model-len 4096 Script: from strands import Agentfrom strands.models.openai import OpenAIModelfrom strands_tools import calculator model = OpenAIModel( client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"}, model_id="meta-llama/Llama-3.2-3B-Instruct", ) agent = Agent(model=model, tools=[calculator])agent("What is 2+2?") Result: The calculator function was called with the expression 2+2. The response from the calculator function was a JSON object containing a content array with a text property and a status property. The text property contained the result of the calculation, which is 4, and the status property was success. — Reply to this email directly, view it on GitHub <#44 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AF43WBCMUUQVJQBTRP3A4LD3AR7GDAVCNFSM6AAAAAB5MT57F6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMJSHE4TCNRRHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

pgrayy · 2025-05-27T23:14:58Z

I have since dismantled the environment. However, I saw the tool call was not invoked when I tried to use the calculate_time tool for Melbourne, Australia. It simply printed the tool as text inside <tool_call></tool_call> tag. Even in your output I do not see Tool #1 tag that seems to be printed as part of tool handling in Strands SDK. HTH, try with say calculate_time tool that will clearly show if the tool is being used or not.
…
On Wed, May 28, 2025 at 1:21 AM Patrick Gray @.> wrote: pgrayy left a comment (strands-agents/sdk-python#44) <#44 (comment)> @pgrayy https://github.com/pgrayy thank you for the details . I test the latest OpenAI.py implementation with (#97 <#97>). It works fine for hermes template, but it fails for llama3_json template with llama models (which is what I have implemented in the PR). Hermes and llama3_json seem to be the widely used templates but there are few others as well. If you still want to only have a shared OpenAI Model Provider implementation, I suggest you to integrate to llama3_json template tool_call handling from this PR as well. However, if you like to keep vLLM Model Provider seperate, with initial implementation supporting hermes and llama3_json templates and let future contributions support other template and native vLLM endpoint. We can update this PR code to use the OpenAI Model Provider for hermes and OpenAI endpoint and use this implementation for OpenAI and llam3_json endpoint. Hope this helps Can you please elaborate on what failures you encountered using Llama. The following example worked for me: Local server: $ python -m vllm.entrypoints.openai.api_server \ --model meta-llama/Llama-3.2-3B-Instruct \ --host 0.0.0.0 \ --port 8000 \ --enable-auto-tool-choice \ --tool-call-parser llama3_json \ --max-model-len 4096 Script: from strands import Agentfrom strands.models.openai import OpenAIModelfrom strands_tools import calculator model = OpenAIModel( client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"}, model_id="meta-llama/Llama-3.2-3B-Instruct", ) agent = Agent(model=model, tools=[calculator])agent("What is 2+2?") Result: The calculator function was called with the expression 2+2. The response from the calculator function was a JSON object containing a content array with a text property and a status property. The text property contained the result of the calculation, which is 4, and the status property was success. — Reply to this email directly, view it on GitHub <#44 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AF43WBCMUUQVJQBTRP3A4LD3AR7GDAVCNFSM6AAAAAB5MT57F6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMJSHE4TCNRRHE . You are receiving this because you authored the thread.Message ID: @.>

The Tool #1 tag was printed to stdout. I only shared the final response. What you described seems to be a situation where the model itself returned the tool call in the text content block instead of in the tool_calls field. I have seen this happen with less powerful models. Sometimes if you repeat the call enough times, the model will generate a correct response. For more consistent results however, I would recommend using a newer/larger version of Llama.

In short, this is a problem with the model, not the model provider or vLLM.

pgrayy · 2025-05-29T23:59:22Z

After review, we decided that we will continue to encourage customers to use either the OpenAI or LiteLLM model providers to interact with models served by vLLM. If you have any more concerns or questions, please don't hesitate to raise an issue or start a discussion. Thank you for this discussion. We really appreciate your contributions.

AhilanPonnusamy added 5 commits May 19, 2025 13:53

Add files via upload

770f8a8

vLLM Model Provider

Update README.md

ebf20fe

Added vLLM model provider example

Update pyproject.toml

b79d449

Add files via upload

6bab061

vLLM Model Provider test cases

Add files via upload

2137064

vLLM Model Provider Integration tests

AhilanPonnusamy requested a review from a team as a code owner May 19, 2025 04:13

AhilanPonnusamy mentioned this pull request May 19, 2025

[FEATURE]Support for vLLM as Model Provider #43

Closed

Update pyproject.toml

5a9c5d2

fixed vllm version to 0.8.5

pgrayy reviewed May 19, 2025

View reviewed changes

AhilanPonnusamy added 4 commits May 20, 2025 14:02

Update vllm.py

c0e2639

Update vllm.py

344749a

fixed bugs

Update test_vllm.py

086e6f5

Adjusted to the newly updated code

Update test_model_vllm.py

624e5cb

Fixed for new code

awsarron assigned pgrayy May 21, 2025

AhilanPonnusamy added 5 commits May 22, 2025 14:43

Update README.md

d31649e

Update vllm.py

7e85e87

Fixed tool usage error fix and updated comments

Update test_vllm.py

1e4d14c

Updated test cases

Update test_model_vllm.py

9a1f835

strandardized assertion

Update test_model_vllm.py

35c5d52

fixed the assert to check the correc tool

pgrayy closed this May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

vLLM Model Provider implementation #44

vLLM Model Provider implementation #44

Uh oh!

AhilanPonnusamy commented May 19, 2025 •

edited

Loading

Uh oh!

AhilanPonnusamy commented May 19, 2025

Uh oh!

pgrayy May 19, 2025 •

edited

Loading

Uh oh!

AhilanPonnusamy May 19, 2025

Uh oh!

pgrayy May 20, 2025

Uh oh!

AhilanPonnusamy May 22, 2025 •

edited

Loading

Uh oh!

pgrayy May 22, 2025 •

edited

Loading

Uh oh!

AhilanPonnusamy May 22, 2025 •

edited

Loading

Uh oh!

pgrayy May 23, 2025 •

edited

Loading

Uh oh!

AhilanPonnusamy commented May 24, 2025

Uh oh!

pgrayy commented May 27, 2025 •

edited

Loading

Uh oh!

AhilanPonnusamy commented May 27, 2025 via email

Uh oh!

pgrayy commented May 27, 2025 •

edited

Loading

Uh oh!

pgrayy commented May 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

vLLM Model Provider implementation #44

vLLM Model Provider implementation #44

Uh oh!

Conversation

AhilanPonnusamy commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issues

Documentation PR

Type of Change

Testing

Checklist

Uh oh!

AhilanPonnusamy commented May 19, 2025

Uh oh!

pgrayy May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AhilanPonnusamy May 19, 2025

Choose a reason for hiding this comment

Uh oh!

pgrayy May 20, 2025

Choose a reason for hiding this comment

Uh oh!

AhilanPonnusamy May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pgrayy May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AhilanPonnusamy May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

No luck @pgrayy please find below

call

Error

Uh oh!

pgrayy May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

AhilanPonnusamy commented May 24, 2025

Uh oh!

pgrayy commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AhilanPonnusamy commented May 27, 2025 via email

Uh oh!

pgrayy commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pgrayy commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

AhilanPonnusamy commented May 19, 2025 •

edited

Loading

pgrayy May 19, 2025 •

edited

Loading

AhilanPonnusamy May 22, 2025 •

edited

Loading

pgrayy May 22, 2025 •

edited

Loading

AhilanPonnusamy May 22, 2025 •

edited

Loading

pgrayy May 23, 2025 •

edited

Loading

pgrayy commented May 27, 2025 •

edited

Loading

pgrayy commented May 27, 2025 •

edited

Loading

pgrayy commented May 29, 2025 •

edited

Loading