-
Notifications
You must be signed in to change notification settings - Fork 153
vLLM Model Provider implementation #44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
vLLM Model Provider
Added vLLM model provider example
vLLM Model Provider test cases
vLLM Model Provider Integration tests
updated description |
fixed vllm version to 0.8.5
src/strands/models/vllm.py
Outdated
class VLLMModel(Model): | ||
"""vLLM model provider implementation. | ||
|
||
Assumes OpenAI-compatible vLLM server at `http://<host>/v1/completions`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since vLLM assumes OpenAI compatibility (docs), I think it actually makes sense for us to generalize here and instead of creating a VLLMModel
provider, we create an OpenAIModel
provider. This is something the Strands Agents team is already discussing as we have other OpenAI compatible providers that could all share the format_request
and format_chunk
logic.
With that said, I would suggest keeping this PR open for the time being as we further work out the details. Thank you for your contribution and patience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it @pgrayy, will wait for the updates. The one thing we will miss out on this approach is the support for the native vLLM APIs via native vLLM endpoint, designed for direct use with the vLLM engine. Do you foresee just that aspect being a vLLM Model Provider by itself? I see this VLLModel provider gradually built to cover all aspects of vLLM including OpenAI compatible end points.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR for the OpenAI model provider. This should work to connect to models served with vLLM.
Regarding your question, could you elaborate on what you mean by native vLLM endpoint? It should still be OpenAI compatible correct? Based on what I am reading in the docs, you can query vLLM using the openai client, which is what this new Strands OpenAI model provider uses under the hood.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @pgrayy , Regarding vLLM endpoints, there are two endpoints provided by vLLM.
-
openai.api_server in vLLM provides an OpenAI-compatible REST API layer, mimicking the behavior and request/response format of OpenAI's API (/v1/chat/completions, etc.).This layer supports both standard and streaming completion modes.
-
api_server, on the other hand, is vLLM’s native API server, offering endpoints like /generate or /completion, designed specifically for internal or custom integrations.While api_server is more flexible and may expose additional low-level features, openai.api_server ensures broader compatibility with the OpenAI ecosystem.
You can run either by setting the appropriate flag in vllm.entrypoints.openai.api_server or vllm.entrypoints.api_server when launching the server.
On another note, I tried to test the OpenAI Model provider from PR against my local vLLM instance. I couldn't pass the API_KEY stage even after setting the env variable "openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: empty. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}" .
As you suggested, shall we keep this vLLM Model Provider PR open until we get the OpenAI one working, or would it be better to merge it for now and deprecate it later if it's deemed redundant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share more details on your testing? What happens when you try the following:
from strands.models.openai import OpenAIModel
openai_model = OpenAIModel({"api_key": "<YOUR_API_KEY>", "base_url": "<YOUR_MODEL_ENDPOINT>"}, model_id="<YOUR_MODEL_ID>")
api_key
will need to be explicitly passed into the model provider unless you set the OPENAI_API_KEY
environment variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No luck @pgrayy please find below
call
OpenAIModel(
api_key="abc123",
base_url="http://localhost:8000",
model_id="Qwen/Qwen3-4B", # Qwen/Qwen3-8B
Error
openai.AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: abc123. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}
Also, I think the way the tool_calls are emitted will be different for different Model Providers. I had to convert the tool_call to a function call with a different format to make it work for vLLM Model Provider implementation as you see in my code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to pass in api_key
to the client_args
. Here are some steps I took for testing:
Setup a local vllm server:
$ python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-7B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 4096
I set hermes
as a tool parser based on instructions here. This seems to be the mechanism to make tool calls OpenAI compatible and so we shouldn't need any special handling in our code.
Next, I setup my agent script passing in api_key
to the client_args
.
from strands import Agent
from strands.models.openai import OpenAIModel
from strands_tools import calculator
model = OpenAIModel(
# can also pass dict as first argument
client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"},
# everything from this point is a kwarg though
model_id="Qwen/Qwen2.5-7B",
)
agent = Agent(model=model, tools=[calculator])
agent("What is 2+2?")
Result:
$ python test_vllm.py
Tool #1: calculator
The result of 2 + 2 is 4.
For this to work, you'll also need to include changes in #97 (already merged into main).
As an alternative, you should also be able to get this working with the LiteLLMModel provider. The LiteLLM docs have a dedicated page on vllm (see here).
fixed bugs
Adjusted to the newly updated code
Fixed for new code
Fixed tool usage error fix and updated comments
Updated test cases
strandardized assertion
fixed the assert to check the correc tool
@pgrayy thank you for the details . I test the latest OpenAI.py implementation with (#97). It works fine for hermes template, but it fails for llama3_json template with llama models (which is what I have implemented in the PR). Hermes and llama3_json seem to be the widely used templates but there are few others as well. If you still want to only have a shared OpenAI Model Provider implementation, I suggest you to integrate to llama3_json template tool_call handling from this PR as well. However, if you like to keep vLLM Model Provider seperate, with initial implementation supporting hermes and llama3_json templates and let future contributions support other template and native vLLM endpoint. We can update this PR code to use the OpenAI Model Provider for hermes and OpenAI endpoint and use this implementation for OpenAI and llam3_json endpoint. Hope this helps |
Can you please elaborate on what failures you encountered using Llama? The following example worked for me: Local server: $ python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--max-model-len 4096 Script: from strands import Agent
from strands.models.openai import OpenAIModel
from strands_tools import calculator
model = OpenAIModel(
client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"},
model_id="meta-llama/Llama-3.2-3B-Instruct",
)
agent = Agent(model=model, tools=[calculator])
agent("What is 2+2?") Result: The calculator function was called with the expression 2+2. The response from the calculator function was a JSON object containing a content array with a text property and a status property. The text property contained the result of the calculation, which is 4, and the status property was success. |
I have since dismantled the environment. However, I saw the tool call was
not invoked when I tried to use the calculate_time tool for Melbourne,
Australia. It simply printed the tool as text inside
<tool_call></tool_call> tag. Even in your output I do not see Tool #1 tag
that seems to be printed as part of tool handling in Strands SDK. HTH, try
with say calculate_time tool that will clearly show if the tool is being
used or not.
…On Wed, May 28, 2025 at 1:21 AM Patrick Gray ***@***.***> wrote:
*pgrayy* left a comment (strands-agents/sdk-python#44)
<#44 (comment)>
@pgrayy <https://github.com/pgrayy> thank you for the details . I test
the latest OpenAI.py implementation with (#97
<#97>). It works fine
for hermes template, but it fails for llama3_json template with llama
models (which is what I have implemented in the PR). Hermes and llama3_json
seem to be the widely used templates but there are few others as well. If
you still want to only have a shared OpenAI Model Provider implementation,
I suggest you to integrate to llama3_json template tool_call handling from
this PR as well. However, if you like to keep vLLM Model Provider seperate,
with initial implementation supporting hermes and llama3_json templates and
let future contributions support other template and native vLLM endpoint.
We can update this PR code to use the OpenAI Model Provider for hermes and
OpenAI endpoint and use this implementation for OpenAI and llam3_json
endpoint. Hope this helps
Can you please elaborate on what failures you encountered using Llama. The
following example worked for me:
Local server:
$ python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3.2-3B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--enable-auto-tool-choice \
--tool-call-parser llama3_json \
--max-model-len 4096
Script:
from strands import Agentfrom strands.models.openai import OpenAIModelfrom strands_tools import calculator
model = OpenAIModel(
client_args={"api_key": "abc123", "base_url": "http://localhost:8000/v1"},
model_id="meta-llama/Llama-3.2-3B-Instruct",
)
agent = Agent(model=model, tools=[calculator])agent("What is 2+2?")
Result:
The calculator function was called with the expression 2+2. The response from the calculator function was a JSON object containing a content array with a text property and a status property. The text property contained the result of the calculation, which is 4, and the status property was success.
—
Reply to this email directly, view it on GitHub
<#44 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AF43WBCMUUQVJQBTRP3A4LD3AR7GDAVCNFSM6AAAAAB5MT57F6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMJSHE4TCNRRHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
The In short, this is a problem with the model, not the model provider or vLLM. |
After review, we decided that we will continue to encourage customers to use either the OpenAI or LiteLLM model providers to interact with models served by vLLM. If you have any more concerns or questions, please don't hesitate to raise an issue or start a discussion. Thank you for this discussion. We really appreciate your contributions. |
Description
vLLM Model provider Implementation support feature.
Related Issues
#43
Documentation PR
[Link to related associated PR in the agent-docs repo]
Type of Change
[Choose one of the above types of changes]
Testing
Built a sample client to test the implementation
hatch fmt --linter
hatch fmt --formatter
hatch test --all
Checklist
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.