feat: tool calling for llama models #155

blefo · 2025-09-25T15:27:37Z

This pull request introduces support for Python code execution as a tool-call, leveraging the e2b Code Interpreter for secure sandboxed execution. The changes include new handler modules for routing and executing code tool-calls, updates to core data models to support tool-call messages, and integration of the tool workflow into the main chat completion pipeline. Additionally, an end-to-end test validates the new code execution capability.

Tool-call and code execution support:

Added e2b-code-interpreter as a dependency to enable secure Python code execution in a sandbox.
Implemented code_execution.py handler, providing async-safe execution of Python code in a sandbox and returning output for tool-calls.
Created tool_router.py to route, execute, and process tool-calls, extract tool-calls from model responses, and manage the tool workflow including follow-up completions with tool outputs.

Integration into chat workflow:

Integrated the tool workflow into the chat completion pipeline, ensuring tool-calls are executed and their outputs are included in the final model response; token usage is aggregated and updated accordingly. [1] [2] [3]

End-to-end testing:

Added an E2E test to validate code execution via tool-call, verifying correct output for a Python SHA-256 computation using the new tool interface.

For testing the endpoint:

{
  "model": "meta-llama/Llama-3.2-1B-Instruct",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant. When a user asks a question that requires calculation, use the execute_python tool to find the answer. After the tool provides its result, you must use that result to formulate a clear, final answer to the user's original question. Do not include any code or JSON in your final response."
    },
    {
      "role": "user",
      "content": "Run a python code for computing the squared sum from zero to ten."
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "execute_python",
        "description": "Executes a snippet of Python code in a secure sandbox and returns the standard output.",
        "parameters": {
          "type": "object",
          "properties": {
            "code": {
              "type": "string",
              "description": "The Python code to be executed."
            }
          },
          "required": ["code"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

The larger the model, the better its performance.

A well-crafted system prompt is critical, especially when it clearly specifies available tools and how they should be invoked. The more precise the prompt the more effectively the model can select the right tool and return accurate results.

…kflow handling

…bility

Copilot

Pull Request Overview

This PR introduces Python code execution capabilities as a tool-call feature for LLaMA models, leveraging the e2b Code Interpreter for secure sandboxed execution. The implementation integrates seamlessly into the existing chat completion pipeline and provides comprehensive support for both structured and JSON-encoded tool calls.

Adds secure Python code execution via e2b Code Interpreter sandbox
Implements comprehensive tool workflow routing and execution pipeline
Integrates tool calling into the main chat completion flow with proper token usage aggregation

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`pyproject.toml`	Adds e2b-code-interpreter dependency for sandboxed code execution
`code_execution.py`	Implements async-safe Python code execution in e2b sandbox
`tool_router.py`	Provides tool call routing, execution, and workflow management
`api_model.py`	Extends MessageAdapter with tool message creation methods
`__init__.py`	Exports new tool-related types for external use
`private.py`	Integrates tool workflow into chat completion pipeline
`test_tool_router.py`	Unit tests for tool routing and execution functionality
`test_code_execution.py`	E2E test validating SHA-256 computation via tool calling

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

nilai-api/src/nilai_api/handlers/tools/code_execution.py

nilai-api/src/nilai_api/routers/private.py

nilai-api/src/nilai_api/handlers/tools/tool_router.py

…tructions

…nd enhance llama3.1 tool template

jcabrero · 2025-10-02T07:37:40Z

docker/compose/docker-compose.llama-8b-gpu.yml

    volumes:
-      - hugging_face_models:/root/.cache/huggingface  # cache models
+      - hugging_face_models:/root/.cache/huggingface
+      - ./vllm_templates:/opt/vllm/templates:ro


Given how it is not as easy to put files in nilCC, I believe we should embed this one in the vLLM image. This is just to overall simplify and make sure it is already there. This is because, in principle, it will not change in the future

jcabrero · 2025-10-02T07:38:06Z

docker/compose/docker-compose.nilai-prod-2.yml

    volumes:
-      - hugging_face_models:/root/.cache/huggingface  # cache models
+      - hugging_face_models:/root/.cache/huggingface
+      - ./vllm_templates:/opt/vllm/templates:ro


Same comment here: Given how it is not as easy to put files in nilCC, I believe we should embed this one in the vLLM image. This is just to overall simplify and make sure it is already there. This is because, in principle, it will not change in the future

jcabrero · 2025-10-02T07:40:12Z

docker/vllm.Dockerfile

    apt-get install build-essential -y && \
    pip install uv && \
    uv sync && \
+    pip install --no-cache-dir --force-reinstall "vllm==0.10.1" && \


I know this comes from your error, but I don't think this should remain here.

vLLM is a large dependency, and we use the vLLM base image in order to avoid having to install it every time. Also, this creates another version dependency outside the "title" one.

I would remove it and verify things still work. If they don't, we can research why the version is an issue, but, reinstalling here should not be the solution now.

jcabrero · 2025-10-02T07:57:07Z

nilai-api/src/nilai_api/handlers/tools/tool_router.py

+        )
+        content: Optional[str] = adapter.extract_text()
+    except Exception:
+        content = getattr(response_message, "content", None)  # type: ignore[assignment]


I don't particularly like this way of doing getattr as the data types are ignored. This is why you need to put #type: ignore at the end.

I know it is cumbersome, but it helps in spotting bugs ahead of time for the most part.

jcabrero · 2025-10-02T08:08:44Z

nilai-api/src/nilai_api/handlers/tools/tool_router.py

+    return msgs
+
+
+def extract_tool_calls_from_response_message(


I am just wondering whether the return type of vLLM allows for this or sometimes returns JSON as you do here.

If no JSON and the ChatCompletionMessage already includes strings, I believe part of what you're doing here would be simplified to:

Take the original response message and dump the message model which produces a Python dictionary

Try to get this model into the destination type ChatCompletionMessageToolCall.

Maybe not right for this use case, as I am not 100% sure of what the schema for the outputs of Llama tool calls in vLLM is.

In the best case, when the model detects a tool_call, then vLLM will directly receive the tool populated as JSON in the content field (for Llama-8b it's the case but it may be model-dependent). In most of the cases vLLM integrated parser will handle correctly the parsing and return a tool as this:

ChatCompletion(id='chatcmpl-82aa4a5848a94ae3b422ec58bf7d6578', choices=[Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, refusal=None, role='assistant', annotations=None, audio=None, function_call=None, tool_calls=[ChatCompletionMessageFunctionToolCall(id='chatcmpl-tool-7d487ca08dcc4806b2ac3817d40b5815', function=Function(arguments='{"code": "import hashlib...print(hash)"}', name='execute_python'), type='function')], reasoning_content=None)...), prompt_logprobs=None, kv_transfer_params=None)

But in some cases that are dependent on the model ability the vLLM parser may fall and then the fallback function is here to parse the tool returned as JSON by the model directly. It's the case for Llama-8b model that will write python code with semicolon. The vLLM parser, for some reason, will split the tool by semicolon and then the parsing will fail as JSON will be incorrectly formatted (here is the vLLM parser for Llama).

(sorry for the long reply and I hope I addressed correctly your point in the review)

jcabrero · 2025-10-02T08:10:13Z

nilai-api/src/nilai_api/handlers/tools/tool_router.py

+    }
+
+    logger.info("[tools] performing follow-up completion with tool outputs")
+    logger.info(f"[tools] request_kwargs messages: {request_kwargs['messages']}")


We shouldn't log any messages. This can be used for testing, but shouldn't be logged.

jcabrero · 2025-10-02T08:10:47Z

nilai-api/src/nilai_api/routers/private.py

    }
    if req.tools:
        request_kwargs["tools"] = req.tools  # type: ignore
+        request_kwargs["tool_choice"] = getattr(req, "tool_choice", "auto")


More getattr. This is not amazing.

jcabrero · 2025-10-02T08:14:00Z

nilai-api/src/nilai_api/routers/private.py

        )
-    # Update token usage
+
+    if agg_prompt_tokens or agg_completion_tokens:


How can this fail and need to be within a try block?

This is not right, if I am understanding it correctly. You're basically overwriting the usage that comes from the final response with the agg_<prompt|completion|total>_tokens. This means that the tokens that you use in your second call are not being logged at all.

jcabrero · 2025-10-02T08:15:26Z

nilai-models/pyproject.toml

 dependencies = [
    "httpx>=0.27.2",
    "nilai-common",
+    "vllm==0.10.1",


Same issue as above. You're trying to add vllm to the dependencies when it is not needed. vLLM is added through the Docker image, and this library only adds a "probe" to connect the models to the frontend. That's why we don't add vllm here before.

jcabrero · 2025-10-02T08:18:58Z

packages/nilai-common/src/nilai_common/api_model.py



+# Ensure symbol is referenced to satisfy linters (and re-export via __init__)
+_CHAT_TOOL_FUNCTION_EXPORT = ChatToolFunction


I would prefer importing directly the ChatToolFunction object from __init__.py than having this trick. It doesn't look good in the code.

Another option could be ... import Function and define a ChatToolFunction: TypeAlias = Function in this file. This looks better in all senses.

jcabrero

Great job. I think this PR looks extremely good, a huge improvement over the previous.

Please check the comments and fix them if they make sense (especially the vllm imports as they can grow significantly the CI times and image size).

Last comment more ideological:

I agree with just doing a single tool call, but more and more it feels like we should allow the models to become partially agents. Maybe just a single tool call is too little for what other models can do.

…n the vllm.Dockerfile

blefo added 3 commits September 25, 2025 13:33

feat: minimal working version with parsing

c89b11a

refactor: remove legacy tool routing logic and integrate new tool wor…

28b75a2

…kflow handling

refactor: clean up code execution and tool routing for improved reada…

7cd3e51

…bility

blefo requested a review from Copilot September 25, 2025 15:51

Copilot AI reviewed Sep 25, 2025

View reviewed changes

nilai-api/src/nilai_api/handlers/tools/code_execution.py Outdated Show resolved Hide resolved

nilai-api/src/nilai_api/routers/private.py Outdated Show resolved Hide resolved

nilai-api/src/nilai_api/handlers/tools/tool_router.py Outdated Show resolved Hide resolved

blefo added 2 commits September 25, 2025 18:13

refactor: update assistant's response logic to include tool usage ins…

6eed1c7

…tructions

chore: update Dockerfile and compose files for freezing vllm 0.10.1 a…

c56da7d

…nd enhance llama3.1 tool template

blefo force-pushed the feat-tool-calling-llama-focused branch from 49de627 to c56da7d Compare September 29, 2025 14:09

blefo requested a review from jcabrero October 1, 2025 10:23

jcabrero reviewed Oct 2, 2025

View reviewed changes

jcabrero approved these changes Oct 2, 2025

View reviewed changes

refactor: enhance type casting in tool extraction and vllm template i…

9262e82

…n the vllm.Dockerfile

blefo force-pushed the feat-tool-calling-llama-focused branch 2 times, most recently from 9f49aaa to 9262e82 Compare October 7, 2025 08:01

blefo merged commit df2a37b into main Oct 7, 2025
16 checks passed



		# Ensure symbol is referenced to satisfy linters (and re-export via __init__)
		_CHAT_TOOL_FUNCTION_EXPORT = ChatToolFunction

feat: tool calling for llama models #155

feat: tool calling for llama models #155

Uh oh!

Conversation

blefo commented Sep 25, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jcabrero left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants