|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "id": "6728b05f-e3bb-487a-8818-e0d5d18b5501", |
| 6 | + "metadata": {}, |
| 7 | + "source": [ |
| 8 | + "# Agent Tool Usage Tasks\n", |
| 9 | + "\n", |
| 10 | + "These tasks are meant to grade your agent's effectiveness at using tools to accomplish tasks.\n", |
| 11 | + "\n", |
| 12 | + "You can check an up-to-date list of tool usage tasks in the registry:" |
| 13 | + ] |
| 14 | + }, |
| 15 | + { |
| 16 | + "cell_type": "code", |
| 17 | + "execution_count": 3, |
| 18 | + "id": "a57e65d7-dbd6-4128-8260-f6ee3b43157c", |
| 19 | + "metadata": {}, |
| 20 | + "outputs": [ |
| 21 | + { |
| 22 | + "data": { |
| 23 | + "text/html": [ |
| 24 | + "<table>\n", |
| 25 | + "<thead>\n", |
| 26 | + "<tr><th>Name </th><th>Type </th><th>Dataset ID </th><th>Description </th></tr>\n", |
| 27 | + "</thead>\n", |
| 28 | + "<tbody>\n", |
| 29 | + "<tr><td>Tool Usage - Typewriter (1 tool) </td><td>ToolUsageTask</td><td><a href=\"https://smith.langchain.com/public/59577193-8938-4ccf-92a7-e8a96bcf4f86/d\" target=\"_blank\" rel=\"noopener\">59577193-8938-4ccf-92a7-e8a96bcf4f86</a></td><td>Environment with a single tool that accepts a single letter as input, and prints it on a piece of virtual paper.\n", |
| 30 | + "\n", |
| 31 | + "The objective of this task is to evaluate the ability of the model to use the provided tools to repeat a given input string.\n", |
| 32 | + "\n", |
| 33 | + "For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order.\n", |
| 34 | + "\n", |
| 35 | + "The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string. </td></tr>\n", |
| 36 | + "<tr><td>Tool Usage - Typewriter (26 tools)</td><td>ToolUsageTask</td><td><a href=\"https://smith.langchain.com/public/128af05e-aa00-4e3b-a958-d166dd450581/d\" target=\"_blank\" rel=\"noopener\">128af05e-aa00-4e3b-a958-d166dd450581</a></td><td>Environment with 26 tools each tool represents a letter of the alphabet.\n", |
| 37 | + "\n", |
| 38 | + "The objective of this task is to evaluate the model's ability the use tools\n", |
| 39 | + "for a simple repetition task.\n", |
| 40 | + "\n", |
| 41 | + "For example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order.\n", |
| 42 | + "\n", |
| 43 | + "The dataset includes examples of varying difficulty. The difficulty is measured by the length of the string.\n", |
| 44 | + "\n", |
| 45 | + "This is a variation of the typer writer task, where 26 parameterless tools are\n", |
| 46 | + "given instead of a single tool that takes a letter as an argument. </td></tr>\n", |
| 47 | + "<tr><td>Tool Usage - Relational Data </td><td>ToolUsageTask</td><td><a href=\"https://smith.langchain.com/public/1d89f4b3-5f73-48cf-a127-2fdeb22f6d84/d\" target=\"_blank\" rel=\"noopener\">1d89f4b3-5f73-48cf-a127-2fdeb22f6d84</a></td><td>Environment with fake data about users and their locations and favorite foods.\n", |
| 48 | + "\n", |
| 49 | + "The environment provides a set of tools that can be used to query the data.\n", |
| 50 | + "\n", |
| 51 | + "The objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data.\n", |
| 52 | + "\n", |
| 53 | + "The dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question.\n", |
| 54 | + "\n", |
| 55 | + "Each example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question.\n", |
| 56 | + "\n", |
| 57 | + "Success is measured by the ability to answer the question correctly, and efficiently. </td></tr>\n", |
| 58 | + "<tr><td>Multiverse Math </td><td>ToolUsageTask</td><td><a href=\"https://smith.langchain.com/public/594f9f60-30a0-49bf-b075-f44beabf546a/d\" target=\"_blank\" rel=\"noopener\">594f9f60-30a0-49bf-b075-f44beabf546a</a></td><td>An environment that contains a few basic math operations, but with altered results.\n", |
| 59 | + "\n", |
| 60 | + "For example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected.\n", |
| 61 | + "\n", |
| 62 | + "The objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math. </td></tr>\n", |
| 63 | + "</tbody>\n", |
| 64 | + "</table>" |
| 65 | + ], |
| 66 | + "text/plain": [ |
| 67 | + "Registry(tasks=[ToolUsageTask(name='Tool Usage - Typewriter (1 tool)', dataset_id='https://smith.langchain.com/public/59577193-8938-4ccf-92a7-e8a96bcf4f86/d', description=\"Environment with a single tool that accepts a single letter as input, and prints it on a piece of virtual paper.\\n\\nThe objective of this task is to evaluate the ability of the model to use the provided tools to repeat a given input string.\\n\\nFor example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order.\\n\\nThe dataset includes examples of varying difficulty. The difficulty is measured by the length of the string.\\n\", create_environment=<function get_environment at 0x132f9cea0>, instructions=\"Repeat the given string using the provided tools. Do not write anything else or provide any explanations. For example, if the string is 'abc', you must print the letters 'a', 'b', and 'c' one at a time and in that order. \"), ToolUsageTask(name='Tool Usage - Typewriter (26 tools)', dataset_id='https://smith.langchain.com/public/128af05e-aa00-4e3b-a958-d166dd450581/d', description=\"Environment with 26 tools each tool represents a letter of the alphabet.\\n\\nThe objective of this task is to evaluate the model's ability the use tools\\nfor a simple repetition task.\\n\\nFor example, if the string is 'abc', the tools 'a', 'b', and 'c' must be invoked in that order.\\n\\nThe dataset includes examples of varying difficulty. The difficulty is measured by the length of the string.\\n\\nThis is a variation of the typer writer task, where 26 parameterless tools are\\ngiven instead of a single tool that takes a letter as an argument.\\n\", create_environment=<function get_environment at 0x132f9d3a0>, instructions=\"Repeat the given string by using the provided tools. Do not write anything else or provide any explanations. For example, if the string is 'abc', you must invoke the tools 'a', 'b', and 'c' in that order. Please invoke the functions without any arguments.\"), ToolUsageTask(name='Tool Usage - Relational Data', dataset_id='https://smith.langchain.com/public/1d89f4b3-5f73-48cf-a127-2fdeb22f6d84/d', description='Environment with fake data about users and their locations and favorite foods.\\n\\nThe environment provides a set of tools that can be used to query the data.\\n\\nThe objective of this task is to evaluate the ability to use the provided tools to answer questions about relational data.\\n\\nThe dataset contains 21 examples of varying difficulty. The difficulty is measured by the number of tools that need to be used to answer the question.\\n\\nEach example is composed of a question, a reference answer, and information about the sequence in which tools should be used to answer the question.\\n\\nSuccess is measured by the ability to answer the question correctly, and efficiently.\\n', create_environment=<function get_environment at 0x132f9c9a0>, instructions=\"Please answer the user's question by using the tools provided. Do not guess the answer. Keep in mind that entities like users,foods and locations have both a name and an ID, which are not the same.\"), ToolUsageTask(name='Multiverse Math', dataset_id='https://smith.langchain.com/public/594f9f60-30a0-49bf-b075-f44beabf546a/d', description='An environment that contains a few basic math operations, but with altered results.\\n\\nFor example, multiplication of 5*3 will be re-interpreted as 5*3*1.1. The basic operations retain some basic properties, such as commutativity, associativity, and distributivity; however, the results are different than expected.\\n\\nThe objective of this task is to evaluate the ability to use the provided tools to solve simple math questions and ignore any innate knowledge about math.\\n', create_environment=<function get_environment at 0x132f9c2c0>, instructions='You are requested to solve math questions in an alternate mathematical universe. The operations have been altered to yield different results than expected. Do not guess the answer or rely on your innate knowledge of math. Use the provided tools to answer the question. While associativity and commutativity apply, distributivity does not. Answer the question using the fewest possible tools. Only include the numeric response without any clarifications.')])" |
| 68 | + ] |
| 69 | + }, |
| 70 | + "execution_count": 3, |
| 71 | + "metadata": {}, |
| 72 | + "output_type": "execute_result" |
| 73 | + } |
| 74 | + ], |
| 75 | + "source": [ |
| 76 | + "from langchain_benchmarks import registry\n", |
| 77 | + "\n", |
| 78 | + "registry.filter(Type=\"ToolUsageTask\")" |
| 79 | + ] |
| 80 | + }, |
| 81 | + { |
| 82 | + "cell_type": "markdown", |
| 83 | + "id": "9f54cdd3-67f6-43ba-a929-1a6ed1b01296", |
| 84 | + "metadata": {}, |
| 85 | + "source": [ |
| 86 | + "### Task resources\n", |
| 87 | + "\n", |
| 88 | + "In addition to a name, daset_id, and description, the `tool_use` directory also has a shared agent factory you can use to get started:" |
| 89 | + ] |
| 90 | + }, |
| 91 | + { |
| 92 | + "cell_type": "code", |
| 93 | + "execution_count": null, |
| 94 | + "id": "3363e86d-3c86-4297-81b6-f73899be48b0", |
| 95 | + "metadata": {}, |
| 96 | + "outputs": [], |
| 97 | + "source": [ |
| 98 | + "from langchain_benchmarks.tool_usage import agents\n", |
| 99 | + "\n", |
| 100 | + "agent_factory = agents.OpenAIAgentFactory(task, model=\"gpt-3.5-turbo-16k\")" |
| 101 | + ] |
| 102 | + }, |
| 103 | + { |
| 104 | + "cell_type": "markdown", |
| 105 | + "id": "994bb145-9b12-4a60-87da-003f44dd13e5", |
| 106 | + "metadata": {}, |
| 107 | + "source": [ |
| 108 | + "They also define a `create_environment` method that returns a ToolUsageEnvironment object:\n", |
| 109 | + "\n", |
| 110 | + "```python\n", |
| 111 | + "class ToolUsageEnvironment:\n", |
| 112 | + " \"\"\"An instance of an environment for tool usage.\"\"\"\n", |
| 113 | + "\n", |
| 114 | + " tools: List[BaseTool]\n", |
| 115 | + " \"\"\"The tools that can be used in the environment.\"\"\"\n", |
| 116 | + "\n", |
| 117 | + " read_state: Optional[Callable[[], Any]] = None\n", |
| 118 | + " \"\"\"A function that returns the current state of the environment.\"\"\"\n", |
| 119 | + "```\n", |
| 120 | + "\n", |
| 121 | + "This is used to define the available tools for a given dataset and to let any evaluators read the world state grade the agent." |
| 122 | + ] |
| 123 | + }, |
| 124 | + { |
| 125 | + "cell_type": "markdown", |
| 126 | + "id": "3d5e48c4-d5d0-4d19-9bab-61c01a512f21", |
| 127 | + "metadata": {}, |
| 128 | + "source": [ |
| 129 | + "### Dataset schema\n", |
| 130 | + "\n", |
| 131 | + "Each task corresponds to a LangSmith dataset with the following schema:\n", |
| 132 | + "\n", |
| 133 | + "Inputs:\n", |
| 134 | + "- `question: str` - the user question\n", |
| 135 | + "\n", |
| 136 | + "Outputs\n", |
| 137 | + "- `expected_steps: list` - the expected order of tools used\n", |
| 138 | + "- `reference: str` - the expected answer\n", |
| 139 | + "\n", |
| 140 | + "There may be additional output keys, such as:\n", |
| 141 | + "- `order_matters`: bool - whether the order of tool invocations matters\n", |
| 142 | + "- `state: any` - the end 'state' the environment should be in after a given data point" |
| 143 | + ] |
| 144 | + }, |
| 145 | + { |
| 146 | + "cell_type": "code", |
| 147 | + "execution_count": null, |
| 148 | + "id": "cd9dea3f-68b5-47f7-9a12-c7b5eafc4a37", |
| 149 | + "metadata": {}, |
| 150 | + "outputs": [], |
| 151 | + "source": [] |
| 152 | + } |
| 153 | + ], |
| 154 | + "metadata": { |
| 155 | + "kernelspec": { |
| 156 | + "display_name": "Python 3 (ipykernel)", |
| 157 | + "language": "python", |
| 158 | + "name": "python3" |
| 159 | + }, |
| 160 | + "language_info": { |
| 161 | + "codemirror_mode": { |
| 162 | + "name": "ipython", |
| 163 | + "version": 3 |
| 164 | + }, |
| 165 | + "file_extension": ".py", |
| 166 | + "mimetype": "text/x-python", |
| 167 | + "name": "python", |
| 168 | + "nbconvert_exporter": "python", |
| 169 | + "pygments_lexer": "ipython3", |
| 170 | + "version": "3.11.2" |
| 171 | + } |
| 172 | + }, |
| 173 | + "nbformat": 4, |
| 174 | + "nbformat_minor": 5 |
| 175 | +} |
0 commit comments