-
Notifications
You must be signed in to change notification settings - Fork 10.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] agent example (w/ sandboxable Tools!) & improved OAI compatibility layer (in Python) #6389
Conversation
Thanks for the effort to bring this nice feature 🥇 . Please mind to push commits on your fork first as it triggers lot of CI runs on the main repo. |
@phymbert sorry for the CI noise again today, wanted to get the PR in good working order. Branch |
Please forgive my comment, firstly because I am sure you do your best, then I personally pushed 60+ commits last 4 days ;) and now the CI is canceling concurrent jobs. Good luck! |
2ba7150
to
7346208
Compare
This is a bit off-topic, but I noticed in your example call:
Is there a particular reason you're using greedy sampling? If so, I think there is an opportunity for speedup when using grammars and greedy sampling, but I wasn't sure how frequently greedy sampling was used, so I haven't chased it down yet. |
Good day @ochafik , According to the official OpenAI API, they produce string, too, look here, for instance: Choice(finish_reason='tool_calls', index=0, logprobs=None, message=ChatCompletionMessage(content=None, role='assistant', function_call=None, tool_calls=[ChatCompletionMessageToolCall(id='call_ujD1NwPxzeOSCbgw2NOabOin', function=Function(arguments='{\n "location": "Glasgow, Scotland",\n "format": "celsius",\n "num_days": 5\n}', name='get_n_day_weather_forecast'), type='function')]), internal_metrics=[{'cached_prompt_tokens': 128, 'total_accepted_tokens': 0, 'total_batched_tokens': 273, 'total_predicted_tokens': 0, 'total_rejected_tokens': 0, 'total_tokens_in_completion': 274, 'cached_embeddings_bytes': 0, 'cached_embeddings_n': 0, 'uncached_embeddings_bytes': 0, 'uncached_embeddings_n': 0, 'fetched_embeddings_bytes': 0, 'fetched_embeddings_n': 0, 'n_evictions': 0, 'sampling_steps': 40, 'sampling_steps_with_predictions': 0, 'batcher_ttft': 0.035738229751586914, 'batcher_initial_queue_time': 0.0007979869842529297}]) Might be something to do with extra security/agents isolation. |
Found expects_stringified_function_arguments ... |
@skoulik thanks so much for testing this out, and for reporting this stringification issue! The Preparing a fix. Also, I've been toying w/ moving some or all of that logic to C++, stay tuned :-D
I mean, this is important to me too haha, but yeah I've been very side-tracked by real-life stuff / the Spring here 😎 |
I've done a quick test and confirm that it works with Llamaingex example now. Langchain will most likely work too.
My five cents: I noticed that you've been experimenting with prompts throughout your commits history. I reckon it might be beneficial to have them customizable, too (in addition to the hard coded templates). People are going to use the server with plethora of different models, some of which might prefer another flavours of promts to work better.
|
…er $OPENAI_API_KEY"
Superseded by #9639
Still very rough, but sharing a draft to get early feedback on the general direction.
This is an experiment in adding grammar-constrained tool support to llama.cpp, with a simple example of running agentic code on top, and support for sandboxing unsafe tools (e.g. Python interpreter).
Instead of bloating
server.cpp
any further, this slaps a Python layer in front of it to handle tool calling (partly because it's hard to do things well w/o proper jinja2 support - templates handle tool calling peculiarly at best, and partly because this could be a way to simplify the C++ server and focus it on performance and security rather than dealing with schemas and chat templates; WDYT?).So this PR has a long way to go, but here's what can be done with it:
Show install instructions
python -m examples.agent \ --model mixtral-8x7b-instruct-v0.1.Q8_0.gguf \ --tools examples/agent/tools/example_math_tools.py \ --greedy \ --goal "What is the sum of 2535 squared and 32222000403 then multiplied by one and a half. What's a third of the result?"
Show output
python -m examples.agent \ --tools examples/agent/tools/fake_weather_tools.py \ --goal "What is the weather going to be like in San Francisco and Glasgow over the next 4 days." \ --greedy
Show output
python -m examples.agent --std-tools --goal "Say something nice in 1 minute."
Show output
Add
--verbose
to see what's going on, and look at examples/agent/README & examples/openai/README for more details.Tool sandboxing
Since tools can quickly become unsafe (don't want a rogue AI poking at your files), I've added a simple script to sandbox tools. It wraps a Python module as a REST server inside a Docker container exposing its port, and since it's using FastAPI it gives a neat OpenAPI schema that can be consumed by the agent code.
Run this in a separate terminal to get a sandboxed python interpreter (
DATA_DIR
will contain any files created by Python programs):Then tell the agent to discover tools at the new endpoint:
python -m examples.agent \ --tools http://localhost:9999 \ --goal "Whats cos(123) / 23 * 12.6 ?"
Show output
python -m examples.agent \ --tools http://localhost:9999 \ --goal "Create a file with 100k spaces"
Show output
python -m examples.agent \ --tools http://localhost:9999 \ --goal "Write and run a program with a syntax error, then fix it"
Show output
Everybody gets tool calling support!
Some models have been explicitly fine-tuned for tool usage (e.g. Functionary with temptative support in #5695, or Hermes 2 Pro Mistral 7B which has a nice repo about it).
Some other models don't officially have support, at least in their OSS models... (Mixtral 👀)
But since #5978, all can be coerced into sticking to a specific JSON schema.
This example supports the following tool prompting strategies in examples/openai/prompting.py (see dizzying combos of outputs):
--style=thoughtful_steps
: the default unless Functionary template is detected.Constrains the output to JSON with the following TypeScript signature (which it advertises as JSON schema), which fully constrains all of the function arguments:
It seems quite important to give the model some space to think before it even decides whether it has the final output or needs extra steps (
thought
might work just as well, YMMV). Note that by default only 1 tool call is allowed, but for models that support parallel tool calling, you can pass--parallel-calls
(Functionary does this well, but Mixtral-instruct tends to hallucinate)--style=functionary_v2
: besides using the proper template, this formats the signatures to TypeScript and deals with interesting edge cases (TODO: check whether this model has the only template that expects function call's arguments to be a json string, as opposed to a JSON object)--style=short
/long
: announces tools in a<tool>...schemas...</tool>
system call, and uses less constrained output that allows mixing text and<tool_call>{json}</tool_call>
inserts.Since there is no negative lookahead (nor reluctant repetition modifier), I found it hard to write a grammar that allows "any text not containing
"<tool_call>"
then maybe<tool_call>
. I settled for something a bit brittle (content := [^<] | "<" [^t<] | "<t" [^o<]
), suggestions welcome!--style=mixtral
: OK now it gets weird. Mixtral works well w/--style=thoughtful_steps
(I just had to collapsesystem
andtool
messages intouser
messages as its chat template is very restrictive), but when prompted w/You have these tools <tools>{json schemas}</tools>
, it spontaneously calls tools with the semi-standard syntax used by Hermes too... except it has spurious underscore escapes 🤔So in the
mixtral
style I just unescape underscores and we get a tool-calling Mixtral (style is otherwise much likelong
/short
and would also benefit from more grammar features)TODOs