Skip to content

Conversation

@blefo
Copy link
Member

@blefo blefo commented Sep 25, 2025

This pull request introduces support for Python code execution as a tool-call, leveraging the e2b Code Interpreter for secure sandboxed execution. The changes include new handler modules for routing and executing code tool-calls, updates to core data models to support tool-call messages, and integration of the tool workflow into the main chat completion pipeline. Additionally, an end-to-end test validates the new code execution capability.

Tool-call and code execution support:

  • Added e2b-code-interpreter as a dependency to enable secure Python code execution in a sandbox.
  • Implemented code_execution.py handler, providing async-safe execution of Python code in a sandbox and returning output for tool-calls.
  • Created tool_router.py to route, execute, and process tool-calls, extract tool-calls from model responses, and manage the tool workflow including follow-up completions with tool outputs.

Integration into chat workflow:

  • Integrated the tool workflow into the chat completion pipeline, ensuring tool-calls are executed and their outputs are included in the final model response; token usage is aggregated and updated accordingly. [1] [2] [3]

End-to-end testing:

  • Added an E2E test to validate code execution via tool-call, verifying correct output for a Python SHA-256 computation using the new tool interface.

For testing the endpoint:

{
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant. When a user asks a question that requires calculation, use the execute_python tool to find the answer. After the tool provides its result, you must use that result to formulate a clear, final answer to the user's original question. Do not include any code or JSON in your final response."
    },
    {
      "role": "user",
      "content": "Run a python code for computing the squared sum from zero to ten."
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "execute_python",
        "description": "Executes a snippet of Python code in a secure sandbox and returns the standard output.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The Python code to be executed."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

The larger the model, the better its performance.

A well-crafted system prompt is critical, especially when it clearly specifies available tools and how they should be invoked. The more precise the prompt the more effectively the model can select the right tool and return accurate results.

@blefo blefo requested a review from Copilot September 25, 2025 15:51
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces Python code execution capabilities as a tool-call feature for LLaMA models, leveraging the e2b Code Interpreter for secure sandboxed execution. The implementation integrates seamlessly into the existing chat completion pipeline and provides comprehensive support for both structured and JSON-encoded tool calls.

  • Adds secure Python code execution via e2b Code Interpreter sandbox
  • Implements comprehensive tool workflow routing and execution pipeline
  • Integrates tool calling into the main chat completion flow with proper token usage aggregation

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
pyproject.toml Adds e2b-code-interpreter dependency for sandboxed code execution
code_execution.py Implements async-safe Python code execution in e2b sandbox
tool_router.py Provides tool call routing, execution, and workflow management
api_model.py Extends MessageAdapter with tool message creation methods
__init__.py Exports new tool-related types for external use
private.py Integrates tool workflow into chat completion pipeline
test_tool_router.py Unit tests for tool routing and execution functionality
test_code_execution.py E2E test validating SHA-256 computation via tool calling

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@blefo blefo force-pushed the feat-tool-calling-llama-focused branch from 49de627 to c56da7d Compare September 29, 2025 14:09
@blefo blefo requested a review from jcabrero October 1, 2025 10:23
volumes:
- hugging_face_models:/root/.cache/huggingface # cache models
- hugging_face_models:/root/.cache/huggingface
- ./vllm_templates:/opt/vllm/templates:ro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how it is not as easy to put files in nilCC, I believe we should embed this one in the vLLM image. This is just to overall simplify and make sure it is already there. This is because, in principle, it will not change in the future

volumes:
- hugging_face_models:/root/.cache/huggingface # cache models
- hugging_face_models:/root/.cache/huggingface
- ./vllm_templates:/opt/vllm/templates:ro
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here: Given how it is not as easy to put files in nilCC, I believe we should embed this one in the vLLM image. This is just to overall simplify and make sure it is already there. This is because, in principle, it will not change in the future

apt-get install build-essential -y && \
pip install uv && \
uv sync && \
pip install --no-cache-dir --force-reinstall "vllm==0.10.1" && \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this comes from your error, but I don't think this should remain here.

vLLM is a large dependency, and we use the vLLM base image in order to avoid having to install it every time. Also, this creates another version dependency outside the "title" one.

I would remove it and verify things still work. If they don't, we can research why the version is an issue, but, reinstalling here should not be the solution now.

)
content: Optional[str] = adapter.extract_text()
except Exception:
content = getattr(response_message, "content", None) # type: ignore[assignment]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like this way of doing getattr as the data types are ignored. This is why you need to put #type: ignore at the end.

I know it is cumbersome, but it helps in spotting bugs ahead of time for the most part.

return msgs


def extract_tool_calls_from_response_message(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just wondering whether the return type of vLLM allows for this or sometimes returns JSON as you do here.

If no JSON and the ChatCompletionMessage already includes strings, I believe part of what you're doing here would be simplified to:

  1. Take the original response message and dump the message model which produces a Python dictionary
  2. Try to get this model into the destination type ChatCompletionMessageToolCall.

Maybe not right for this use case, as I am not 100% sure of what the schema for the outputs of Llama tool calls in vLLM is.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the best case, when the model detects a tool_call, then vLLM will directly receive the tool populated as JSON in the content field (for Llama-8b it's the case but it may be model-dependent). In most of the cases vLLM integrated parser will handle correctly the parsing and return a tool as this:

ChatCompletion(id='chatcmpl-82aa4a5848a94ae3b422ec58bf7d6578', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-7d487ca08dcc4806b2ac3817d40b5815', function=Function(arguments='{"code": "import hashlib...print(hash)"}', name='execute_python'), type='function')], reasoning_content=None)...), prompt_logprobs=None, kv_transfer_params=None)

But in some cases that are dependent on the model ability the vLLM parser may fall and then the fallback function is here to parse the tool returned as JSON by the model directly. It's the case for Llama-8b model that will write python code with semicolon. The vLLM parser, for some reason, will split the tool by semicolon and then the parsing will fail as JSON will be incorrectly formatted (here is the vLLM parser for Llama).

(sorry for the long reply and I hope I addressed correctly your point in the review)

}

logger.info("[tools] performing follow-up completion with tool outputs")
logger.info(f"[tools] request_kwargs messages: {request_kwargs['messages']}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't log any messages. This can be used for testing, but shouldn't be logged.

}
if req.tools:
request_kwargs["tools"] = req.tools # type: ignore
request_kwargs["tool_choice"] = getattr(req, "tool_choice", "auto")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More getattr. This is not amazing.

)
# Update token usage

if agg_prompt_tokens or agg_completion_tokens:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can this fail and need to be within a try block?

This is not right, if I am understanding it correctly. You're basically overwriting the usage that comes from the final response with the agg_<prompt|completion|total>_tokens. This means that the tokens that you use in your second call are not being logged at all.

dependencies = [
"httpx>=0.27.2",
"nilai-common",
"vllm==0.10.1",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above. You're trying to add vllm to the dependencies when it is not needed. vLLM is added through the Docker image, and this library only adds a "probe" to connect the models to the frontend. That's why we don't add vllm here before.



# Ensure symbol is referenced to satisfy linters (and re-export via __init__)
_CHAT_TOOL_FUNCTION_EXPORT = ChatToolFunction
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer importing directly the ChatToolFunction object from __init__.py than having this trick. It doesn't look good in the code.

Another option could be ... import Function and define a ChatToolFunction: TypeAlias = Function in this file. This looks better in all senses.

Copy link
Member

@jcabrero jcabrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job. I think this PR looks extremely good, a huge improvement over the previous.

Please check the comments and fix them if they make sense (especially the vllm imports as they can grow significantly the CI times and image size).

Last comment more ideological:

  • I agree with just doing a single tool call, but more and more it feels like we should allow the models to become partially agents. Maybe just a single tool call is too little for what other models can do.

@blefo blefo force-pushed the feat-tool-calling-llama-focused branch 2 times, most recently from 9f49aaa to 9262e82 Compare October 7, 2025 08:01
@blefo blefo merged commit df2a37b into main Oct 7, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants