[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

ochafik · 2024-03-29T20:07:43Z

Superseded by #9639

Still very rough, but sharing a draft to get early feedback on the general direction.

This is an experiment in adding grammar-constrained tool support to llama.cpp, with a simple example of running agentic code on top, and support for sandboxing unsafe tools (e.g. Python interpreter).

Instead of bloating server.cpp any further, this slaps a Python layer in front of it to handle tool calling (partly because it's hard to do things well w/o proper jinja2 support - templates handle tool calling peculiarly at best, and partly because this could be a way to simplify the C++ server and focus it on performance and security rather than dealing with schemas and chat templates; WDYT?).

So this PR has a long way to go, but here's what can be done with it:

Show install instructions

Note: To get conda, just install Miniforge (it's OSS): https://github.com/conda-forge/miniforge
For Docker (to sandbox unsafe tools), consider lima (containerd / OSS) or Docker Desktop

git clone https://github.com/ochafik/llama.cpp --branch agent-example --single-branch llama.cpp-agent
cd llama.cpp-agent
conda create -n agent python=3.11
conda activate agent
pip install -r examples/agent/requirements.txt

make clean && make -j server

python -m examples.agent \
    --model mixtral-8x7b-instruct-v0.1.Q8_0.gguf \
    --tools examples/agent/tools/example_math_tools.py \
    --greedy \
    --goal "What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?"

Show output

🛠️  add, divide, multiply, pow
💭 First, I will calculate the square of 2535, then add it to 32222000403. After that, I will multiply the result by 1.5 and finally, I will divide the result by 3.
⚙️  pow(value=2535, power=2) -> 6426225.0
💭 Now that I have calculated the square of 2535, I will calculate the sum of 6426225 and 32222000403.
⚙️  add(a=6426225, b=32222000403) -> 32228426628
💭 Now that I have calculated the sum, I will multiply it by 1.5.
⚙️  multiply(a=32228426628, b=1.5) -> 48342639942.0
💭 Now that I have calculated the product, I will divide it by 3.
⚙️  divide(a=48342639942.0, b=3) -> 16114213314.0
➡️  The result of the calculation is 16114213314.0.

python -c "print((2535**2 + 32222000403)*1.5 / 3)"
# 16114213314.0

python -m examples.agent \
    --tools examples/agent/tools/fake_weather_tools.py \
    --goal "What is the weather going to be like in San Francisco and Glasgow over the next 4 days." \
    --greedy

Show output

🛠️  get_current_weather, get_n_day_weather_forecast
💭 I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
⚙️  get_current_weather(location=San Francisco, format=fahrenheit) -> ...
💭 I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
⚙️  get_n_day_weather_forecast(location=San Francisco, format=fahrenheit, num_days=4) -> ...
💭 I will first get the current weather in San Francisco, then get the 4-day weather forecast for both San Francisco and Glasgow.
⚙️  get_n_day_weather_forecast(location=Glasgow, format=celsius, num_days=4) -> ...
The current weather in San Francisco is sunny and 87.8F. Here is the 4-day weather forecast:

For San Francisco:
- In 1 day: Cloudy, 60.8F
- In 2 days: Sunny, 73.4F
- In 3 days: Cloudy, 62.6F

For Glasgow:
- In 1 day: Cloudy, 16C
- In 2 days: Sunny, 23C
- In 3 days: Cloudy, 17C

python -m examples.agent --std-tools --goal "Say something nice in 1 minute."

Show output

🛠️  ask_user, say_out_loud, wait_for_date, wait_for_duration
💭 Thinking about what to say in the next minute.
⚙️  say_out_loud(something="In the next minute, I'll share a kind and uplifting message. Please wait...") -> None
💭 Waiting for the specified duration.
⚙️  wait_for_duration(duration={"seconds": 60}) -> None
💭 Thinking about what to say after the waiting period.
⚙️  say_out_loud(something="Thank you for your patience. Here's a nice message for you: 'A smile is the prettiest thing you can wear. So let your smile shine through.' - Dolly Parton") -> None
➡️ "The task of saying something nice in 1 minute is complete."

Add --verbose to see what's going on, and look at examples/agent/README & examples/openai/README for more details.

Tool sandboxing

Since tools can quickly become unsafe (don't want a rogue AI poking at your files), I've added a simple script to sandbox tools. It wraps a Python module as a REST server inside a Docker container exposing its port, and since it's using FastAPI it gives a neat OpenAPI schema that can be consumed by the agent code.

Run this in a separate terminal to get a sandboxed python interpreter (DATA_DIR will contain any files created by Python programs):

# Note: with limactl, the default sandbox location ~/.llama.cpp/sandbox won't be writable
# (see https://github.com/lima-vm/lima/discussions/393)
# export DATA_DIR=/tmp/lima/llama.cpp/sandbox
PORT=9999 examples/agent/run_sandboxed_tools.sh \
        examples/agent/tools/unsafe_python_tools.py

# INFO: using DATA_DIR: /Users/ochafik/.llama.cpp/sandbox
# ...
# INFO:     Uvicorn running on http://0.0.0.0:9999 (Press CTRL+C to quit)

Then tell the agent to discover tools at the new endpoint:

python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Whats cos(123) / 23 * 12.6 ?"

Show output

💭 To calculate the given mathematical expression, I will use the 'execute_python' tool.
⚙️  execute_python(source="import math\nresult = math.cos(123) / 23 * 12.6") -> {'result': -0.4864525314920599}
➡️ "-0.4864525314920599"

python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Create a file with 100k spaces"

Show output

💭 To accomplish the task of creating a file with 100k spaces, I need to consider how to generate and save this content. I will use the 'execute_python' tool to create a Python script that generates the spaces and writes them to a file.
⚙️  execute_python(source="with open('spaces_file.txt', 'w') as f:\n\tf.write(' ' * 100000)\n") -> {}
➡️ "File 'spaces_file.txt' has been created with 100,000 spaces."

wc -c ~/.llama.cpp/sandbox/spaces_file.txt
# 100000 ~/.llama.cpp/sandbox/spaces_file.txt

python -m examples.agent \
    --tools http://localhost:9999 \
    --goal "Write and run a program with a syntax error, then fix it"

Show output

💭 I need to create a Python program with a syntax error, run it, and then correct the error.
⚙️  execute_python(source="def hello:\\n    print('Hello, world')\\nheelo()") -> Error: expected '(' (<string>, line 1)
💭 I see that there is a syntax error in the code due to a typo. I need to correct the typo and run the code again.
⚙️  execute_python(source="def hello():\\n    print('Hello, world')\\nhello()") -> Error: unexpected character after line continuation character (<string>, line 1)
💭 It seems there is another syntax error in the corrected code. I need to investigate the issue and run the code again.
⚙️  execute_python(source="def hello():\n    print('Hello, world')\nhello()") -> {}
➡️ "Hello, world"

Everybody gets tool calling support!

Some models have been explicitly fine-tuned for tool usage (e.g. Functionary with temptative support in #5695, or Hermes 2 Pro Mistral 7B which has a nice repo about it).

Some other models don't officially have support, at least in their OSS models... (Mixtral 👀)

But since #5978, all can be coerced into sticking to a specific JSON schema.

This example supports the following tool prompting strategies in examples/openai/prompting.py (see dizzying combos of outputs):

--style=thoughtful_steps: the default unless Functionary template is detected.

Constrains the output to JSON with the following TypeScript signature (which it advertises as JSON schema), which fully constrains all of the function arguments:
```
  {
    thought_about_next_step_only: string,
    next_step: (
      {result: T} |
      {
        tool_calls: (
          {name: "function1", arguments: {"arg1": ..., "arg2":...}} |
          {name: "function1", arguments: {"arg1": ..., "arg2":...}} |
          ...
        )[]
      }
    )
  }
  // Where T is the output JSON schema from the --format flag, or 'any'
```
It seems quite important to give the model some space to think before it even decides whether it has the final output or needs extra steps (thought might work just as well, YMMV). Note that by default only 1 tool call is allowed, but for models that support parallel tool calling, you can pass --parallel-calls (Functionary does this well, but Mixtral-instruct tends to hallucinate)
--style=functionary_v2: besides using the proper template, this formats the signatures to TypeScript and deals with interesting edge cases (TODO: check whether this model has the only template that expects function call's arguments to be a json string, as opposed to a JSON object)
--style=short / long: announces tools in a <tool>...schemas...</tool> system call, and uses less constrained output that allows mixing text and <tool_call>{json}</tool_call> inserts.

Since there is no negative lookahead (nor reluctant repetition modifier), I found it hard to write a grammar that allows "any text not containing "<tool_call>" then maybe <tool_call>. I settled for something a bit brittle (content := [^<] | "<" [^t<] | "<t" [^o<]), suggestions welcome!
--style=mixtral: OK now it gets weird. Mixtral works well w/ --style=thoughtful_steps (I just had to collapse system and tool messages into user messages as its chat template is very restrictive), but when prompted w/ You have these tools <tools>{json schemas}</tools>, it spontaneously calls tools with the semi-standard syntax used by Hermes too... except it has spurious underscore escapes 🤔
```
Imma tell you what i'm doin'
<tool\_call>
{"arguments": ..., "name": "my\_weirdly\_escaped\_function\_name"}
</tool\_call>`
```
So in the mixtral style I just unescape underscores and we get a tool-calling Mixtral (style is otherwise much like long / short and would also benefit from more grammar features)

TODOs

phymbert · 2024-03-30T06:23:29Z

Thanks for the effort to bring this nice feature 🥇 . Please mind to push commits on your fork first as it triggers lot of CI runs on the main repo.

github-actions · 2024-04-09T10:09:44Z

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 532 iterations 🚀

Expand details for performance related PR only

Concurrent users: 8, duration: 10m
HTTP request : avg=8802.46ms p(95)=21471.34ms fails=, finish reason: stop=478 truncated=54
Prompt processing (pp): avg=108.38tk/s p(95)=484.77tk/s
Token generation (tg): avg=32.26tk/s p(95)=46.99tk/s
ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=agent-example commit=298c098d51dba049681225bc24cbe9196420081d

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717974573 --> 1717975199
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 353.08, 353.08, 353.08, 353.08, 353.08, 596.48, 596.48, 596.48, 596.48, 596.48, 605.48, 605.48, 605.48, 605.48, 605.48, 636.31, 636.31, 636.31, 636.31, 636.31, 724.51, 724.51, 724.51, 724.51, 724.51, 740.66, 740.66, 740.66, 740.66, 740.66, 752.85, 752.85, 752.85, 752.85, 752.85, 768.74, 768.74, 768.74, 768.74, 768.74, 787.13, 787.13, 787.13, 787.13, 787.13, 788.7, 788.7, 788.7, 788.7, 788.7, 813.9, 813.9, 813.9, 813.9, 813.9, 829.22, 829.22, 829.22, 829.22, 829.22, 848.17, 848.17, 848.17, 848.17, 848.17, 845.3, 845.3, 845.3, 845.3, 845.3, 850.51, 850.51, 850.51, 850.51, 850.51, 849.98, 849.98, 849.98, 849.98, 849.98, 830.29, 830.29, 830.29, 830.29, 830.29, 831.2, 831.2, 831.2, 831.2, 831.2, 849.73, 849.73, 849.73, 849.73, 849.73, 853.47, 853.47, 853.47, 853.47, 853.47, 853.9, 853.9, 853.9, 853.9, 853.9, 860.86, 860.86, 860.86, 860.86, 860.86, 863.19, 863.19, 863.19, 863.19, 863.19, 877.29, 877.29, 877.29, 877.29, 877.29, 881.25, 881.25, 881.25, 881.25, 881.25, 881.92, 881.92, 881.92, 881.92, 881.92, 882.04, 882.04, 882.04, 882.04, 882.04, 851.73, 851.73, 851.73, 851.73, 851.73, 852.31, 852.31, 852.31, 852.31, 852.31, 852.41, 852.41, 852.41, 852.41, 852.41, 856.33, 856.33, 856.33, 856.33, 856.33, 856.4, 856.4, 856.4, 856.4, 856.4, 854.99, 854.99, 854.99, 854.99, 854.99, 858.83, 858.83, 858.83, 858.83, 858.83, 859.08, 859.08, 859.08, 859.08, 859.08, 859.05, 859.05, 859.05, 859.05, 859.05, 862.03, 862.03, 862.03, 862.03, 862.03, 862.92, 862.92, 862.92, 862.92, 862.92, 861.13, 861.13, 861.13, 861.13, 861.13, 856.58, 856.58, 856.58, 856.58, 856.58, 860.93, 860.93, 860.93, 860.93, 860.93, 861.89, 861.89, 861.89, 861.89, 861.89, 868.87, 868.87, 868.87, 868.87, 868.87, 873.36, 873.36, 873.36, 873.36, 873.36, 873.89, 873.89, 873.89, 873.89, 873.89, 872.64, 872.64, 872.64, 872.64, 872.64, 870.6, 870.6, 870.6, 870.6, 870.6, 868.93, 868.93, 868.93, 868.93, 868.93, 875.46, 875.46, 875.46, 875.46, 875.46, 873.95, 873.95, 873.95, 873.95, 873.95, 873.28, 873.28, 873.28, 873.28, 873.28, 875.53, 875.53, 875.53, 875.53, 875.53, 879.04, 879.04, 879.04, 879.04, 879.04, 883.08, 883.08, 883.08, 883.08, 883.08, 882.43, 882.43, 882.43, 882.43, 882.43, 880.52, 880.52, 880.52, 880.52, 880.52, 878.93, 878.93, 878.93, 878.93, 878.93, 879.3, 879.3, 879.3, 879.3, 879.3, 879.97, 879.97, 879.97, 879.97, 879.97, 880.65, 880.65, 880.65, 880.65, 880.65, 881.77, 881.77, 881.77, 881.77]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717974573 --> 1717975199
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 41.52, 41.52, 41.52, 41.52, 41.52, 33.66, 33.66, 33.66, 33.66, 33.66, 25.15, 25.15, 25.15, 25.15, 25.15, 29.4, 29.4, 29.4, 29.4, 29.4, 31.29, 31.29, 31.29, 31.29, 31.29, 32.6, 32.6, 32.6, 32.6, 32.6, 34.2, 34.2, 34.2, 34.2, 34.2, 34.79, 34.79, 34.79, 34.79, 34.79, 35.15, 35.15, 35.15, 35.15, 35.15, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.53, 34.64, 34.64, 34.64, 34.64, 34.64, 33.41, 33.41, 33.41, 33.41, 33.41, 32.69, 32.69, 32.69, 32.69, 32.69, 32.66, 32.66, 32.66, 32.66, 32.66, 29.7, 29.7, 29.7, 29.7, 29.7, 29.48, 29.48, 29.48, 29.48, 29.48, 29.8, 29.8, 29.8, 29.8, 29.8, 29.88, 29.88, 29.88, 29.88, 29.88, 29.47, 29.47, 29.47, 29.47, 29.47, 29.31, 29.31, 29.31, 29.31, 29.31, 29.27, 29.27, 29.27, 29.27, 29.27, 29.48, 29.48, 29.48, 29.48, 29.48, 29.79, 29.79, 29.79, 29.79, 29.79, 29.74, 29.74, 29.74, 29.74, 29.74, 29.79, 29.79, 29.79, 29.79, 29.79, 29.88, 29.88, 29.88, 29.88, 29.88, 30.06, 30.06, 30.06, 30.06, 30.06, 30.17, 30.17, 30.17, 30.17, 30.17, 30.53, 30.53, 30.53, 30.53, 30.53, 30.61, 30.61, 30.61, 30.61, 30.61, 30.72, 30.72, 30.72, 30.72, 30.72, 30.82, 30.82, 30.82, 30.82, 30.82, 30.99, 30.99, 30.99, 30.99, 30.99, 30.98, 30.98, 30.98, 30.98, 30.98, 30.65, 30.65, 30.65, 30.65, 30.65, 30.64, 30.64, 30.64, 30.64, 30.64, 29.92, 29.92, 29.92, 29.92, 29.92, 29.79, 29.79, 29.79, 29.79, 29.79, 29.9, 29.9, 29.9, 29.9, 29.9, 30.09, 30.09, 30.09, 30.09, 30.09, 30.12, 30.12, 30.12, 30.12, 30.12, 30.31, 30.31, 30.31, 30.31, 30.31, 30.28, 30.28, 30.28, 30.28, 30.28, 30.07, 30.07, 30.07, 30.07, 30.07, 29.86, 29.86, 29.86, 29.86, 29.86, 28.74, 28.74, 28.74, 28.74, 28.74, 28.4, 28.4, 28.4, 28.4, 28.4, 28.41, 28.41, 28.41, 28.41, 28.41, 28.4, 28.4, 28.4, 28.4, 28.4, 28.43, 28.43, 28.43, 28.43, 28.43, 28.41, 28.41, 28.41, 28.41, 28.41, 28.52, 28.52, 28.52, 28.52, 28.52, 28.58, 28.58, 28.58, 28.58, 28.58, 28.59, 28.59, 28.59, 28.59, 28.59, 28.6, 28.6, 28.6, 28.6, 28.6, 28.56, 28.56, 28.56, 28.56, 28.56, 28.63, 28.63, 28.63, 28.63, 28.63, 28.74, 28.74, 28.74, 28.74, 28.74, 28.87, 28.87, 28.87, 28.87, 28.87, 28.93, 28.93, 28.93, 28.93]

Details

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717974573 --> 1717975199
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.11, 0.11, 0.11, 0.11, 0.11, 0.35, 0.35, 0.35, 0.35, 0.35, 0.2, 0.2, 0.2, 0.2, 0.2, 0.11, 0.11, 0.11, 0.11, 0.11, 0.21, 0.21, 0.21, 0.21, 0.21, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.12, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.22, 0.22, 0.22, 0.22, 0.22, 0.29, 0.29, 0.29, 0.29, 0.29, 0.22, 0.22, 0.22, 0.22, 0.22, 0.39, 0.39, 0.39, 0.39, 0.39, 0.41, 0.41, 0.41, 0.41, 0.41, 0.43, 0.43, 0.43, 0.43, 0.43, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.19, 0.18, 0.18, 0.18, 0.18, 0.18, 0.31, 0.31, 0.31, 0.31, 0.31, 0.25, 0.25, 0.25, 0.25, 0.25, 0.21, 0.21, 0.21, 0.21, 0.21, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.16, 0.16, 0.16, 0.16, 0.16, 0.21, 0.21, 0.21, 0.21, 0.21, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.11, 0.11, 0.11, 0.11, 0.11, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.3, 0.3, 0.3, 0.3, 0.3, 0.31, 0.31, 0.31, 0.31, 0.31, 0.43, 0.43, 0.43, 0.43, 0.43, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.07, 0.07, 0.07, 0.07, 0.07, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.28, 0.28, 0.28, 0.28, 0.28, 0.46, 0.46, 0.46, 0.46, 0.46, 0.58, 0.58, 0.58, 0.58, 0.58, 0.54, 0.54, 0.54, 0.54, 0.54, 0.42, 0.42, 0.42, 0.42, 0.42, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.24, 0.23, 0.23, 0.23, 0.23, 0.23, 0.09, 0.09, 0.09, 0.09, 0.09, 0.17, 0.17, 0.17, 0.17, 0.17, 0.25, 0.25, 0.25, 0.25, 0.25, 0.19, 0.19, 0.19, 0.19, 0.19, 0.26, 0.26, 0.26, 0.26, 0.26, 0.21, 0.21, 0.21, 0.21, 0.21, 0.14, 0.14, 0.14, 0.14, 0.14, 0.15, 0.15, 0.15, 0.15, 0.15, 0.16, 0.16, 0.16, 0.16, 0.16, 0.14, 0.14, 0.14, 0.14]

More

---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 532 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717974573 --> 1717975199
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 8.0, 8.0, 8.0, 8.0, 8.0, 2.0, 2.0, 2.0, 2.0, 2.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0]

ochafik · 2024-04-10T08:52:37Z

Please mind to push commits on your fork first as it triggers lot of CI runs on the main repo.

@phymbert sorry for the CI noise again today, wanted to get the PR in good working order. Branch agent-example-tmp will get most of the updates going forward (I might also experiment with skip-checks:true).

phymbert · 2024-04-10T09:01:40Z

@phymbert sorry for the CI noise again today, wanted to get the PR in good working order.

Please forgive my comment, firstly because I am sure you do your best, then I personally pushed 60+ commits last 4 days ;) and now the CI is canceling concurrent jobs. Good luck!

… calls

HanClinto · 2024-04-30T17:37:58Z

This is a bit off-topic, but I noticed in your example call:

python -m examples.agent \
    --model mixtral-8x7b-instruct-v0.1.Q8_0.gguf \
    --tools examples/agent/tools/example_math_tools.py \
    --greedy \

Is there a particular reason you're using greedy sampling? If so, I think there is an opportunity for speedup when using grammars and greedy sampling, but I wasn't sure how frequently greedy sampling was used, so I haven't chased it down yet.

typos

fix

skoulik · 2024-05-06T06:48:51Z

Good day @ochafik ,
I know you are busy with important things, but still.
I've been experimenting with your agent-example branch in an attempt to pair llama.cpp with llamaindex agentic api.
It almost works as expected. The only glitch I've found so far is that llamaindex barks at me, because it expects function arguments to be string (JSON escaped? and converted to string) while your agent returns them as pure JSON. Is it something that is supposed to be confugurable? Or is it just a WIP? An omission?

According to the official OpenAI API, they produce string, too, look here, for instance:
https://cookbook.openai.com/examples/how_to_call_functions_with_chat_models

Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_ujD1NwPxzeOSCbgw2NOabOin', function=Function(arguments='{\n "location": "Glasgow, Scotland",\n "format": "celsius",\n "num_days": 5\n}', name='get_n_day_weather_forecast'), type='function')]), internal_metrics=[{'cached_prompt_tokens': 128, 'total_accepted_tokens': 0, 'total_batched_tokens': 273, 'total_predicted_tokens': 0, 'total_rejected_tokens': 0, 'total_tokens_in_completion': 274, 'cached_embeddings_bytes': 0, 'cached_embeddings_n': 0, 'uncached_embeddings_bytes': 0, 'uncached_embeddings_n': 0, 'fetched_embeddings_bytes': 0, 'fetched_embeddings_n': 0, 'n_evictions': 0, 'sampling_steps': 40, 'sampling_steps_with_predictions': 0, 'batcher_ttft': 0.035738229751586914, 'batcher_initial_queue_time': 0.0007979869842529297}])

Might be something to do with extra security/agents isolation.

skoulik · 2024-05-06T06:57:12Z

Good day @ochafik , I know you are busy with important things, but still. I've been experimenting with your agent-example branch in an attempt to pair llama.cpp with llamaindex agentic api. It almost works as expected. The only glitch I've found so .
...
Might be something to do with extra security/agents isolation.

Found expects_stringified_function_arguments ...

ochafik · 2024-05-18T16:34:17Z

@skoulik thanks so much for testing this out, and for reporting this stringification issue! The expects_stringified_function_arguments thing was me trying to work around some templates hard-coding that expectation of stringified args, but I hadn't dug enough to realize that's how things were actually meant to be 😅

Preparing a fix.

Also, I've been toying w/ moving some or all of that logic to C++, stay tuned :-D

Good day @ochafik , I know you are busy with important things

I mean, this is important to me too haha, but yeah I've been very side-tracked by real-life stuff / the Spring here 😎

skoulik · 2024-05-20T01:39:10Z

Preparing a fix.

I've done a quick test and confirm that it works with Llamaingex example now. Langchain will most likely work too.

Also, I've been toying w/ moving some or all of that logic to C++, stay tuned :-D

My five cents: I noticed that you've been experimenting with prompts throughout your commits history. I reckon it might be beneficial to have them customizable, too (in addition to the hard coded templates). People are going to use the server with plethora of different models, some of which might prefer another flavours of promts to work better.

Good day @ochafik , I know you are busy with important things

I mean, this is important to me too haha, but yeah I've been very side-tracked by real-life stuff / the Spring here 😎
No rush here, take your time and enjoy the life :)

…er $OPENAI_API_KEY"

…#6389

ochafik force-pushed the agent-example branch from 8451cdb to 38329c8 Compare April 9, 2024 09:37

ochafik changed the title ~~[WIP] agent example (w/ Tools!) & improved OAI compatibility layer (in Python)~~ [WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) Apr 10, 2024

This was referenced Apr 11, 2024

JSON schema conversion: ⚡️ faster repetitions, min/maxLength for strings, cap number length #6555

Merged

grammars: x{min,max} repetition operator #6640

Merged

ochafik force-pushed the agent-example branch 2 times, most recently from 2ba7150 to 7346208 Compare April 21, 2024 17:54

ochafik added 20 commits April 27, 2024 23:11

gguf: add GGUFReader.read_field(field) method + read template example

0d47c43

grammars: add troubleshooting section to readme

0d1d46e

server.py: hacky code

63d1324

agents: scripts to run scripts as sandboxed fastapi servers

ffc7436

server.py: default tools work!

d5d9993

server.py: make tools work w/ mixtral-8x7b-instruct

8afd4de

server.py: kinda api-compliant output, disabled grammar

aa9605c

server.py: reenable grammar, accommodate mistral's escaped underscores

a406293

server.py: raise n_predict

63a384d

server.py: pass all request options, comments in ts sigs, render tool…

5f3de16

… calls

server.py: refactor chat handlers

59b4114

server.py: crude reactor

253b68d

agent: split code from openai example

e874565

Update agent.py

b63f91a

Update example_weather_tools.py

c340e8c

agent: add --allow_parallel_calls

ce2fb01

agent/openai:nits

ea34bd3

openai: fix message merging for mixtral (parallel calls)

80c7930

Update prompting.py

9ab493f

agent: --style

e0c8af4

openai: update after merge

312e20b

typos

ochafik force-pushed the agent-example branch from 1e65517 to 312e20b Compare April 30, 2024 17:53

teleprint-me mentioned this pull request May 1, 2024

Server: add function calling API #5588

Closed

3 tasks

ochafik added 3 commits May 2, 2024 03:20

server: tool call grammar-constraints

ca1a640

fix

agent: url params

2b2127c

server: update tool calling, introduce system prompt for json schema

e41b6ce

mofosyne added enhancement New feature or request Review Complexity : High Generally require indepth knowledge of LLMs or GPUs labels May 10, 2024

ochafik added 2 commits May 18, 2024 18:19

openai: function call arguments must be returned stringified!

a1d64cf

Merge remote-tracking branch 'origin/master' into agent-example

3f5a25f

github-actions bot added examples python python script changes server labels May 18, 2024

Olivier Chafik added 7 commits May 21, 2024 18:12

openai: fix merge

5ea637e

Merge remote-tracking branch 'origin/master' into agent-example

6dadcd2

openai: make content optional for tool call grammar gen

c8458fa

openai: pretty indent json response

a39e6e0

agent: support OpenAI: --endpoint https://api.openai.com --auth "Bear…

793f4ff

…er $OPENAI_API_KEY"

server: ultra basic tools, tool_choice, tool_calls support

a1c4aac

Merge remote-tracking branch 'origin/master' into agent-example

298c098

ochafik mentioned this pull request Sep 25, 2024

Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars #9639

Merged

41 tasks

ochafik pushed a commit to ochafik/llama.cpp that referenced this pull request Sep 26, 2024

tool-call: adapt very simple agent + docker isolation from ggml-org…

8299fac

…#6389

ochafik closed this Sep 27, 2024

ochafik mentioned this pull request Feb 5, 2025

server: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless --reasoning-format none #11607

Merged

9 tasks

ochafik mentioned this pull request Mar 10, 2025

Feature Request: grammar / json schema with reasoning format. Allow model free to think but strict to answer. #12276

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

Uh oh!

ochafik commented Mar 29, 2024 •

edited

Loading

Uh oh!

phymbert commented Mar 30, 2024

Uh oh!

github-actions bot commented Apr 9, 2024 •

edited

Loading

Uh oh!

ochafik commented Apr 10, 2024

Uh oh!

phymbert commented Apr 10, 2024 •

edited

Loading

Uh oh!

HanClinto commented Apr 30, 2024 •

edited

Loading

Uh oh!

skoulik commented May 6, 2024 •

edited

Loading

Uh oh!

skoulik commented May 6, 2024 •

edited

Loading

Uh oh!

ochafik commented May 18, 2024 •

edited

Loading

Uh oh!

skoulik commented May 20, 2024

Uh oh!

Uh oh!

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389

Uh oh!

Conversation

ochafik commented Mar 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tool sandboxing

Everybody gets tool calling support!

TODOs

Uh oh!

phymbert commented Mar 30, 2024

Uh oh!

github-actions bot commented Apr 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ochafik commented Apr 10, 2024

Uh oh!

phymbert commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HanClinto commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skoulik commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skoulik commented May 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ochafik commented May 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

skoulik commented May 20, 2024

Uh oh!

Uh oh!

ochafik commented Mar 29, 2024 •

edited

Loading

github-actions bot commented Apr 9, 2024 •

edited

Loading

phymbert commented Apr 10, 2024 •

edited

Loading

HanClinto commented Apr 30, 2024 •

edited

Loading

skoulik commented May 6, 2024 •

edited

Loading

skoulik commented May 6, 2024 •

edited

Loading

ochafik commented May 18, 2024 •

edited

Loading