-
Notifications
You must be signed in to change notification settings - Fork 10
feat: tool calling for llama models #155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces Python code execution capabilities as a tool-call feature for LLaMA models, leveraging the e2b Code Interpreter for secure sandboxed execution. The implementation integrates seamlessly into the existing chat completion pipeline and provides comprehensive support for both structured and JSON-encoded tool calls.
- Adds secure Python code execution via e2b Code Interpreter sandbox
- Implements comprehensive tool workflow routing and execution pipeline
- Integrates tool calling into the main chat completion flow with proper token usage aggregation
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
pyproject.toml |
Adds e2b-code-interpreter dependency for sandboxed code execution |
code_execution.py |
Implements async-safe Python code execution in e2b sandbox |
tool_router.py |
Provides tool call routing, execution, and workflow management |
api_model.py |
Extends MessageAdapter with tool message creation methods |
__init__.py |
Exports new tool-related types for external use |
private.py |
Integrates tool workflow into chat completion pipeline |
test_tool_router.py |
Unit tests for tool routing and execution functionality |
test_code_execution.py |
E2E test validating SHA-256 computation via tool calling |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
…nd enhance llama3.1 tool template
49de627 to
c56da7d
Compare
| volumes: | ||
| - hugging_face_models:/root/.cache/huggingface # cache models | ||
| - hugging_face_models:/root/.cache/huggingface | ||
| - ./vllm_templates:/opt/vllm/templates:ro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given how it is not as easy to put files in nilCC, I believe we should embed this one in the vLLM image. This is just to overall simplify and make sure it is already there. This is because, in principle, it will not change in the future
| volumes: | ||
| - hugging_face_models:/root/.cache/huggingface # cache models | ||
| - hugging_face_models:/root/.cache/huggingface | ||
| - ./vllm_templates:/opt/vllm/templates:ro |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here: Given how it is not as easy to put files in nilCC, I believe we should embed this one in the vLLM image. This is just to overall simplify and make sure it is already there. This is because, in principle, it will not change in the future
docker/vllm.Dockerfile
Outdated
| apt-get install build-essential -y && \ | ||
| pip install uv && \ | ||
| uv sync && \ | ||
| pip install --no-cache-dir --force-reinstall "vllm==0.10.1" && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this comes from your error, but I don't think this should remain here.
vLLM is a large dependency, and we use the vLLM base image in order to avoid having to install it every time. Also, this creates another version dependency outside the "title" one.
I would remove it and verify things still work. If they don't, we can research why the version is an issue, but, reinstalling here should not be the solution now.
| ) | ||
| content: Optional[str] = adapter.extract_text() | ||
| except Exception: | ||
| content = getattr(response_message, "content", None) # type: ignore[assignment] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't particularly like this way of doing getattr as the data types are ignored. This is why you need to put #type: ignore at the end.
I know it is cumbersome, but it helps in spotting bugs ahead of time for the most part.
| return msgs | ||
|
|
||
|
|
||
| def extract_tool_calls_from_response_message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just wondering whether the return type of vLLM allows for this or sometimes returns JSON as you do here.
If no JSON and the ChatCompletionMessage already includes strings, I believe part of what you're doing here would be simplified to:
- Take the original response message and dump the message model which produces a Python dictionary
- Try to get this model into the destination type
ChatCompletionMessageToolCall.
Maybe not right for this use case, as I am not 100% sure of what the schema for the outputs of Llama tool calls in vLLM is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the best case, when the model detects a tool_call, then vLLM will directly receive the tool populated as JSON in the content field (for Llama-8b it's the case but it may be model-dependent). In most of the cases vLLM integrated parser will handle correctly the parsing and return a tool as this:
ChatCompletion(id='chatcmpl-82aa4a5848a94ae3b422ec58bf7d6578', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-7d487ca08dcc4806b2ac3817d40b5815', function=Function(arguments='{"code": "import hashlib...print(hash)"}', name='execute_python'), type='function')], reasoning_content=None)...), prompt_logprobs=None, kv_transfer_params=None)
But in some cases that are dependent on the model ability the vLLM parser may fall and then the fallback function is here to parse the tool returned as JSON by the model directly. It's the case for Llama-8b model that will write python code with semicolon. The vLLM parser, for some reason, will split the tool by semicolon and then the parsing will fail as JSON will be incorrectly formatted (here is the vLLM parser for Llama).
(sorry for the long reply and I hope I addressed correctly your point in the review)
| } | ||
|
|
||
| logger.info("[tools] performing follow-up completion with tool outputs") | ||
| logger.info(f"[tools] request_kwargs messages: {request_kwargs['messages']}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't log any messages. This can be used for testing, but shouldn't be logged.
| } | ||
| if req.tools: | ||
| request_kwargs["tools"] = req.tools # type: ignore | ||
| request_kwargs["tool_choice"] = getattr(req, "tool_choice", "auto") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More getattr. This is not amazing.
| ) | ||
| # Update token usage | ||
|
|
||
| if agg_prompt_tokens or agg_completion_tokens: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can this fail and need to be within a try block?
This is not right, if I am understanding it correctly. You're basically overwriting the usage that comes from the final response with the agg_<prompt|completion|total>_tokens. This means that the tokens that you use in your second call are not being logged at all.
nilai-models/pyproject.toml
Outdated
| dependencies = [ | ||
| "httpx>=0.27.2", | ||
| "nilai-common", | ||
| "vllm==0.10.1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above. You're trying to add vllm to the dependencies when it is not needed. vLLM is added through the Docker image, and this library only adds a "probe" to connect the models to the frontend. That's why we don't add vllm here before.
|
|
||
|
|
||
| # Ensure symbol is referenced to satisfy linters (and re-export via __init__) | ||
| _CHAT_TOOL_FUNCTION_EXPORT = ChatToolFunction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer importing directly the ChatToolFunction object from __init__.py than having this trick. It doesn't look good in the code.
Another option could be ... import Function and define a ChatToolFunction: TypeAlias = Function in this file. This looks better in all senses.
jcabrero
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job. I think this PR looks extremely good, a huge improvement over the previous.
Please check the comments and fix them if they make sense (especially the vllm imports as they can grow significantly the CI times and image size).
Last comment more ideological:
- I agree with just doing a single tool call, but more and more it feels like we should allow the models to become partially agents. Maybe just a single tool call is too little for what other models can do.
…n the vllm.Dockerfile
9f49aaa to
9262e82
Compare
This pull request introduces support for Python code execution as a tool-call, leveraging the e2b Code Interpreter for secure sandboxed execution. The changes include new handler modules for routing and executing code tool-calls, updates to core data models to support tool-call messages, and integration of the tool workflow into the main chat completion pipeline. Additionally, an end-to-end test validates the new code execution capability.
Tool-call and code execution support:
e2b-code-interpreteras a dependency to enable secure Python code execution in a sandbox.code_execution.pyhandler, providing async-safe execution of Python code in a sandbox and returning output for tool-calls.tool_router.pyto route, execute, and process tool-calls, extract tool-calls from model responses, and manage the tool workflow including follow-up completions with tool outputs.Integration into chat workflow:
End-to-end testing:
For testing the endpoint:
{ "model": "meta-llama/Llama-3.2-1B-Instruct", "messages": [ { "role": "system", "content": "You are a helpful assistant. When a user asks a question that requires calculation, use the execute_python tool to find the answer. After the tool provides its result, you must use that result to formulate a clear, final answer to the user's original question. Do not include any code or JSON in your final response." }, { "role": "user", "content": "Run a python code for computing the squared sum from zero to ten." } ], "tools": [ { "type": "function", "function": { "name": "execute_python", "description": "Executes a snippet of Python code in a secure sandbox and returns the standard output.", "parameters": { "type": "object", "properties": { "code": { "type": "string", "description": "The Python code to be executed." } }, "required": ["code"] } } } ], "tool_choice": "auto" }The larger the model, the better its performance.
A well-crafted system prompt is critical, especially when it clearly specifies available tools and how they should be invoked. The more precise the prompt the more effectively the model can select the right tool and return accurate results.