An MCP tool schema evaluation framework for testing LLM tool-calling capabilities.
Toolvaluator helps you evaluate how well language models use your FastMCP tools by measuring:
- Correctness: Does the model choose the right tool with the right arguments?
- Latency: How fast does the model make tool-calling decisions?
You can use this to:
- Compare different models to find which works best with your tools
- Optimize your tool schemas and descriptions for better model performance
- Establish quality benchmarks for your MCP tools
- Use simple synthetic tool definitions to see how good a model is at tool-calling
- Chain together multiple tool calls (real or mocked), evaluating along the way
I made this to fill out a little gap in my toolset where I would create all these MCP servers but wouldn't figure out what to do when models stumbled around my tool definintions. The tool name and organization along with the signature and docstring is crucial to making sure the models just figure out my tools. In practice, I've seen wildly different behavior from model to model. The accuracy of the model's decisions is quite important. The latency is a secondary concern of mine because there are some situations when I need to string together a bunch of tool calls and want to see how many I can cram into a unit of time.
In the basic test flow, this doesn't get the model to call the tool. We are only measuring the model's decision. It's reading the MCP tool definition in your tool and using that with the model, just like when your AI client registers your tools with the model.
In the more advanced chained test flow, you have the option to actually call your tools. This allows you to perform evaluations of subsequent tool calls by nudging the model toward calling those tools.
The primary purpose of this is to test the quality of your tool names, descriptions, and arguments. This is the data first registered with the model. The simple, non-chained test workflow can help you test this.
This is most useful in a chained tool call test flow. It's useful when you want to test the path a model follows through your toolset. You have the ability to mock one of your tool calls or actually call the tool (dealing with side-effects). The purpose of this is not to test the tool itself, just to get the required information (the tool response) back into the model's context so it can influence subsequent tool call decisions.
# Install the package
uv pip install toolvaluator
# Or install from source with dev dependencies
git clone https://github.com/cwbooth5/toolvaluator.git
cd toolvaluator
uv pip install -e ".[dev]"
or for an editable install...
uv tool install -e .You can install the tool straight out of the github repo.
uv tool install --from https://github.com/cwbooth5/toolvaluator.git toolvaluatorpip install toolvaluator
# Or install from source with dev dependencies
git clone https://github.com/cwbooth5/toolvaluator.git
cd toolvaluator
pip install -e ".[dev]"First, create a FastMCP server with your tools (e.g., my_server.py):
from fastmcp import FastMCP
mcp = FastMCP("My Server")
@mcp.tool()
def search_docs(query: str) -> str:
"""Search through company documentation."""
return f"Found documents matching '{query}'"
if __name__ == "__main__":
mcp.run()# Evaluate with OpenAI GPT-4o-mini (default)
toolvaluator --server my_server
# Use a different model
toolvaluator --server my_server --model gpt-4o
# Use a local/OSS model
toolvaluator --server my_server --model llama-3.1 --base-url http://localhost:1234/v1 --api-key none
# Set a minimum score threshold (useful for CI/CD)
toolvaluator --server my_server --min-score 0.85For testing your own MCP tools, use the kickstart tool to generate a standalone evaluation script:
# Generate a custom evaluation script for your MCP server
toolvaluator-init \
--server my_server \
--server-var mcp \
--output eval_my_tools.py
# This auto-detects your tools and creates a template script
# Edit eval_my_tools.py to customize the evaluation examples
# Run your custom evaluation
python eval_my_tools.py --model gpt-4o-mini --verbose
# Use in CI/CD with quality gates
python eval_my_tools.py --model gpt-4o --min-score 0.85What is a "dataset"? The dataset is a collection of evaluation examples - test cases that verify if the model can:
- Decide when to call a tool (
should_call) - Choose the right tool (
tool_name) - Extract correct arguments (
arguments)
Example evaluation example:
dspy.Example(
user_query="Find the company vacation policy",
tool_name="search_docs",
tool_description="Search through company documentation",
tool_schema=schema,
expected_should_call=True,
expected_tool_name="search_docs",
expected_arguments={"query": "vacation policy"},
).with_inputs("user_query", "tool_name", "tool_description", "tool_schema")Each example is scored 0-1 across these three dimensions, and the overall score is the average.
You can also use toolvaluator programmatically in your own scripts:
from toolvaluator import (
build_dataset,
eval_model,
get_tool_schemas_sync,
extract_input_schema,
extract_tool_description
)
import dspy
from my_server import mcp
# Fetch tool schemas
tool_schemas = get_tool_schemas_sync(mcp)
# Create custom evaluation examples
examples = []
if "my_tool" in tool_schemas:
tool_name = "my_tool"
tool_description = extract_tool_description(tool_schemas[tool_name])
tool_schema = extract_input_schema(tool_schemas[tool_name])
examples.append(
dspy.Example(
user_query="Use my tool to process ABC",
tool_name=tool_name,
tool_description=tool_description,
tool_schema=tool_schema,
expected_should_call=True,
expected_tool_name=tool_name,
expected_arguments={"input": "ABC"},
).with_inputs("user_query", "tool_name", "tool_description", "tool_schema")
)
# Run evaluation
result = eval_model(
model_name="gpt-4o-mini",
api_key="your-api-key",
base_url=None,
dataset=examples,
verbose=True,
)
print(f"Score: {result['score']:.3f}")For even cleaner code, use the ExampleBuilder helper class to reduce boilerplate:
from toolvaluator import ExampleBuilder, get_tool_schemas_sync, eval_model
from my_server import mcp
# Fetch tool schemas
tool_schemas = get_tool_schemas_sync(mcp)
# Create builder
builder = ExampleBuilder(tool_schemas)
# Add examples with minimal boilerplate
builder.add_positive(
tool="search_docs",
query="Find our vacation policy",
arguments={"query": "vacation policy"}
)
builder.add_negative(
tool="search_docs",
query="What is 2+2?" # Should answer directly
)
# Method chaining works too!
builder.add_positive(
tool="get_weather",
query="Weather in Tokyo?",
arguments={"location": "Tokyo", "units": None} # None = wildcard
).add_positive(
tool="calculate",
query="What is 5 * 10?",
arguments={"operation": "multiply", "a": 5, "b": 10}
)
# Run evaluation
result = eval_model(
model_name="gpt-4o-mini",
api_key="your-api-key",
dataset=builder.examples,
verbose=True,
)Key benefits:
- No manual schema extraction: Automatically extracts tool_name, tool_description, and tool_schema
- Cleaner syntax: Focus on query and expected arguments, not boilerplate
- Method chaining: Fluently build test suites
- Built-in validation: Raises errors if tool doesn't exist
- Convenience methods:
add_positive()andadd_negative()for common cases
toolvaluator/
├── src/
│ └── toolvaluator/
│ ├── __init__.py # Package initialization with version
│ ├── cli.py # Command-line interface
│ ├── evaluator.py # Core evaluation logic
│ └── test_server.py # Example MCP server for testing
├── tests/
│ ├── __init__.py
│ ├── conftest.py # Pytest fixtures
│ ├── test_evaluator.py # Tests for evaluator module
│ └── test_server.py # Tests for test server
├── pyproject.toml # Project configuration
└── README.md
# Clone the repository
git clone https://github.com/cwbooth5/toolvaluator.git
cd toolvaluator
# Install with dev dependencies using uv
uv pip install -e ".[dev]"
# Or using pip
pip install -e ".[dev]"# Run all tests
pytest
# Run with coverage
pytest --cov=toolvaluator --cov-report=html
# Run specific test file
pytest tests/test_evaluator.py# Format code with black
black src tests
# Lint with ruff
ruff check src tests
# Fix auto-fixable issues
ruff check --fix src tests# Build the package
uv build
# Or using hatch
hatch build
# This creates dist/toolvaluator-*.whl and dist/toolvaluator-*.tar.gzThe version is managed by Hatch and stored in src/toolvaluator/__init__.py.
To update the version:
# Edit src/toolvaluator/__init__.py
__version__ = "0.2.0"Then rebuild:
uv buildThe package includes a test MCP server with example tools:
# Run evaluation against the test server
toolvaluator --server toolvaluator.test_server
# Or use it programmatically
python -c "from toolvaluator.test_server import mcp; print(mcp)"The test server includes these tools:
search_docs(query)- Search documentationget_weather(location, units)- Get weather informationcalculate(operation, a, b)- Perform calculationssend_email(to, subject, body, cc)- Send emailscreate_task(title, description, priority, due_date)- Create tasks
- Schema Extraction: Fetches tool schemas from your FastMCP server using the MCP protocol
- Dataset Creation: Creates test examples with user queries and expected tool-calling behavior
- LLM Evaluation: Uses DSPy to prompt the model to decide:
- Should the tool be called?
- Which tool should be called?
- What arguments should be passed?
- Scoring: Compares the model's decisions against expected behavior across three dimensions:
- Should-call correctness (binary: did it correctly decide to use/not use the tool?)
- Tool name correctness (binary: did it pick the right tool?)
- Argument correctness (fraction: how many arguments matched?)
- Metrics: Reports overall correctness score (0-1) and latency statistics (mean, p50, p95, max)
You can use None as a wildcard value in expected_arguments to indicate that a specific argument must exist but you don't care about its value. This is useful when:
- You want to verify the model extracts a parameter but the exact value varies
- Testing argument presence without caring about content
- The value is dynamic (timestamps, IDs, generated text, etc.)
Examples:
# Exact value checking (strict)
expected_arguments = {
"location": "Tokyo",
"units": "celsius"
}
# Model must return exactly: location="Tokyo", units="celsius"
# Wildcard value checking (flexible)
expected_arguments = {
"location": None, # Any location is acceptable
"units": "celsius" # Must be exactly "celsius"
}
# Model can return: location="Tokyo" ✓, location="Paris" ✓, etc.
# Mixed approach
expected_arguments = {
"to": None, # Any email address
"subject": "Meeting", # Must be exactly "Meeting"
"body": None # Any body text
}Important distinctions:
expected_arguments=None→ Don't care about any arguments at all (score 1.0)expected_arguments={}→ Expect NO arguments (score 0.0 if model provides any)expected_arguments={"key": None}→ Expect "key" to exist with any value
The argument comparison uses sophisticated scoring that:
- Exact matches: Each correctly matched argument contributes
1/nto the score (where n = number of expected arguments) - Wildcards: Arguments with
Nonevalues count as matched if the key exists - Missing keys: Arguments that should exist but don't contribute
0/nto the score - Extra keys penalty: Unexpected arguments incur a
-0.5/npenalty per extra key - Final score:
max(0.0, base_score - extra_penalty)
Examples:
# Perfect match
expected = {"arg1": "val1", "arg2": 42}
predicted = {"arg1": "val1", "arg2": 42}
# Score: 1.0
# Partial match
expected = {"arg1": "val1", "arg2": 42}
predicted = {"arg1": "val1", "arg2": 99}
# Score: 0.5 (only arg1 matched)
# Extra arguments penalty
expected = {"arg1": "val1", "arg2": 42}
predicted = {"arg1": "val1", "arg2": 42, "extra1": "foo", "extra2": "bar"}
# Base: 2/2 = 1.0, Penalty: (2 * 0.5) / 2 = 0.5, Final: 0.5
# Empty dict means "no arguments expected"
expected = {}
predicted = {"arg1": "val1"}
# Score: 0.0 (model should not have provided arguments)The evaluator provides the model with full tool context to prevent hallucination:
- Tool name: The exact name of the tool being evaluated
- Tool description: What the tool does
- Tool schema: Parameter definitions and types
This prevents the model from hallucinating tool names or misunderstanding tool purposes. Each evaluation asks: "Given this specific tool and this query, should the tool be called and with what arguments?"
By default, evaluations do not use system prompts. However, you can optionally provide system prompts at two levels with a priority system:
Priority: Example/Step-level > Eval-level > None (default)
Regular Evaluation:
# Build dataset with optional per-example system prompts
builder = ExampleBuilder(tool_schemas)
# Example 1: No system prompt (will use eval-level if provided)
builder.add_positive(
tool="calculate",
query="What is 5 + 10?",
arguments={"operation": "add", "a": 5, "b": 10}
)
# Example 2: Custom system prompt (overrides eval-level)
builder.add_positive(
tool="search_docs",
query="Find our vacation policy",
arguments={"query": "vacation"},
system_prompt="You are a helpful HR assistant." # Example-level
)
dataset = builder.build()
# Evaluate with optional default system prompt
result = eval_model(
model_name="gpt-4o-mini",
api_key="your-key",
base_url=None,
dataset=dataset,
system_prompt="You are a helpful assistant." # Eval-level (optional)
)Chained Evaluation:
# Build chained dataset with optional per-step system prompts
builder = ChainedExampleBuilder(tool_schemas, mcp)
builder.add_chain(
mocks={}
).add_step(
initial_query="Calculate 5 * 10",
expected_tool="calculate",
expected_arguments={"operation": "multiply", "a": 5, "b": 10},
system_prompt="You are a math expert." # Step-level (optional)
)
dataset = builder.build()
# Evaluate with optional default system prompt
result = eval_chained_model(
model_name="gpt-4o-mini",
api_key="your-key",
dataset=dataset,
system_prompt="You are a helpful assistant." # Eval-level (optional)
)How it works:
- If an example/step has its own
system_prompt, that is used - Otherwise, if the eval function has a
system_prompt, that is used - Otherwise, no system prompt is used (default behavior)
Use cases:
- Test how system prompts affect tool-calling behavior
- Compare model performance with different system prompts
- Provide role-specific context for specific examples
WARNING: This feature actually executes tools on your MCP server!
For testing multi-step workflows where one tool's output feeds into the next, use the same pattern as regular evaluation:
Step 1: Build Dataset
from toolvaluator import ChainedExampleBuilder, eval_chained_model, get_tool_schemas_sync
from my_server import mcp
tool_schemas = get_tool_schemas_sync(mcp)
# Build chained test dataset (same pattern as ExampleBuilder!)
builder = ChainedExampleBuilder(tool_schemas, mcp)
# Add a chain (sequence of tool calls)
builder.add_chain(
mocks={} # Optional: mock tools to avoid side effects
).add_step(
initial_query="Calculate 15 * 23, then divide the result by 5",
expected_tool="calculate",
expected_arguments={"operation": "multiply", "a": 15, "b": 23}
).add_step(
expected_tool="calculate",
expected_arguments={"operation": "divide", "a": None, "b": 5} # 'a' comes from step 1
)
# Build dataset
dataset = builder.build()Step 2: Evaluate with Model
# Evaluate with model config (same pattern as eval_model!)
result = eval_chained_model(
model_name="gpt-4o-mini",
api_key="your-api-key",
dataset=dataset,
verbose=True
)
print(f"Overall score: {result['score']:.2f}")
print(f"Steps completed: {result['num_steps']}")
for chain in result['chain_results']:
for step in chain['step_results']:
print(f" Step {step['step']}: {step['predicted_tool']} → {step['tool_result']}")Test Same Dataset with Multiple Models:
# Test with GPT-4o-mini
result1 = eval_chained_model(
model_name="gpt-4o-mini",
api_key="openai-key",
dataset=dataset
)
# Test with Claude (same dataset!)
result2 = eval_chained_model(
model_name="claude-3-sonnet",
api_key="anthropic-key",
dataset=dataset
)
# Test with local model
result3 = eval_chained_model(
model_name="local-model",
api_key="lm-studio",
base_url="http://localhost:1234/v1",
dataset=dataset
)
print(f"GPT-4o-mini score: {result1['score']:.2f}")
print(f"Claude score: {result2['score']:.2f}")
print(f"Local model score: {result3['score']:.2f}")Using Mocks to Avoid Side Effects:
Mocks go in the dataset (test configuration), not the eval function (model configuration):
# Add mocks when building the dataset
builder.add_chain(
mocks={
"calculate": lambda args: str(args["a"] + args["b"]), # Callable mock
"search_docs": "Here are the relevant documents...", # Static mock
}
).add_step(
initial_query="Calculate 5 + 10",
expected_tool="calculate",
expected_arguments={"operation": "add", "a": 5, "b": 10}
).add_step(
expected_tool="search_docs",
expected_arguments={"query": None},
mock_result="Custom result for this step" # Per-step override
)
dataset = builder.build()
# Evaluate - no tools executed because of mocks!
result = eval_chained_model(
model_name="gpt-4o-mini",
api_key="your-key",
dataset=dataset
)Mock Priority:
- Per-step
mock_result(highest priority) - Global
mocksdict - Actual tool execution (lowest priority)
Mock Types:
- Static string:
mocks={"tool": "result"}- Always returns "result" - Callable:
mocks={"tool": lambda args: ...}- Computes result from arguments - Per-step override:
add_step(..., mock_result="...")- Overrides global mock
Multiple Chains:
You can add multiple chains to the same dataset:
builder = ChainedExampleBuilder(tool_schemas, mcp)
# Chain 1: Calculate workflow
builder.add_chain(
mocks={"calculate": lambda args: str(args["a"] * args["b"])}
).add_step(
initial_query="Calculate 5 * 10, then add 25",
expected_tool="calculate",
expected_arguments={"operation": "multiply", "a": 5, "b": 10}
).add_step(
expected_tool="calculate",
expected_arguments={"operation": "add", "a": None, "b": 25}
)
# Chain 2: Weather workflow
builder.add_chain(
mocks={"get_weather": "72°F, sunny"}
).add_step(
initial_query="Get weather for San Francisco",
expected_tool="get_weather",
expected_arguments={"location": "San Francisco", "units": None}
)
# Evaluate all chains with one call
dataset = builder.build()
result = eval_chained_model(
model_name="gpt-4o-mini",
api_key="your-key",
dataset=dataset
)
print(f"Evaluated {result['num_chains']} chains")
print(f"Overall score: {result['score']:.2f}")API Consistency:
The chained evaluation API follows the exact same pattern as regular evaluation:
| Step | Regular Evaluation | Chained Evaluation |
|---|---|---|
| 1. Build dataset | ExampleBuilder(...) |
ChainedExampleBuilder(...) |
| 2. Add tests | .add_positive(...) |
.add_chain().add_step(...) |
| 3. Create dataset | .build() |
.build() |
| 4. Evaluate | eval_model(model, key, dataset) |
eval_chained_model(model, key, dataset) |
Key Principle: Dataset = what to test, Eval function = which model to test with
How it works:
- Model receives initial query
- Model decides which tool to call (step 1)
- Tool is executed (or mocked if specified)
- Result is fed back to model as context
- Model decides next tool call (step 2)
- Process continues for all steps
- Each step is evaluated independently
Side Effects & Warnings:
Tools are executed for real - This is not a simulation!
- Data modification: Tools may create, update, or delete data
- Network calls: APIs may be called, emails sent, etc.
- Resource consumption: Database queries, file operations, etc.
- Costs: API calls to external services may incur charges
- Idempotency: Running the chain multiple times may produce different results
Best practices:
- Use test/staging MCP servers, not production
- Implement mock/test versions of tools with side effects
- Use tools that are safe to execute multiple times
- Log all tool executions for audit trails
- Consider implementing dry-run mode in your tools
When to use:
- Testing multi-step agent workflows
- Validating tool orchestration logic
- Integration testing of tool chains
- Debugging complex tool interactions
When NOT to use:
- Production data or services
- Tools with irreversible side effects
- Financial transactions or critical operations
- Unless you fully understand the implications!
Run evaluations using the built-in test dataset:
toolvaluator [OPTIONS]Options:
--model: Model name (default:gpt-4o-mini)--api-key: API key (defaults toOPENAI_API_KEYenv var)--base-url: Base URL for OpenAI-compatible endpoints (e.g.,http://localhost:1234/v1)--min-score: Minimum acceptable score 0-1 (exits with code 1 if below threshold)--server: Python module containing your FastMCP server (default:server)--server-var: Name of the FastMCP instance variable (default:mcp)- Use this if your server uses a different variable name like
apporserver
- Use this if your server uses a different variable name like
--verbose,-v: Show detailed debug information for each example
Examples:
# Basic usage
toolvaluator --server my_server
# Custom variable name
toolvaluator --server my_server --server-var app
# Local model with verbose output
toolvaluator --server my_server \
--model devstral-small \
--base-url http://localhost:1234/v1 \
--api-key lm-studio \
--verboseGenerate custom evaluation scripts:
toolvaluator-init [OPTIONS]Options:
--server: Python module containing your MCP server (required)--server-var: Name of the FastMCP instance variable (default:mcp)--output,-o: Output filename (default:eval_tools.py)--tools: Tool names to generate examples for (auto-detected if not specified)--force,-f: Overwrite output file if it exists
Examples:
# Auto-detect tools and generate script
toolvaluator-init --server my_server --output eval_my_tools.py
# Specify custom variable name
toolvaluator-init --server my_server --server-var app --output eval.py
# Specify specific tools
toolvaluator-init --server my_server --tools tool1 tool2 tool3
# Overwrite existing file
toolvaluator-init --server my_server --output eval.py --forceOPENAI_API_KEY: API key for OpenAI models
Test which model works best with your tools:
toolvaluator --server my_server --model gpt-4o
toolvaluator --server my_server --model gpt-4o-mini
toolvaluator --server my_server --model claude-3-5-sonnetNOTE: If you're using a base_url to point to a custom hosted model,
we assume an openai provider or openai-style API is being provided.
This happens to be what LM studio and Ollama provide.
Iterate on tool descriptions and schemas to improve model accuracy:
# Before: vague description
@mcp.tool()
def search(q: str) -> str:
"""Search stuff."""
...
# After: clear, specific description
@mcp.tool()
def search_docs(query: str) -> str:
"""
Search through company documentation including policies,
procedures, and internal wikis.
Args:
query: Search keywords or natural language question
"""
...Use generated evaluation scripts in your CI/CD pipeline to ensure tool quality:
Step 1: Generate evaluation script (one-time)
toolvaluator-init --server my_server --output eval_tools.py
# Edit eval_tools.py to add your evaluation examples
# Commit eval_tools.py to your repositoryStep 2: Add to your CI pipeline
GitHub Actions example (.github/workflows/test.yml):
name: Test MCP Tools
on: [push, pull_request]
jobs:
test-tools:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.12'
- name: Install dependencies
run: |
pip install toolvaluator
pip install -r requirements.txt # Your project dependencies
- name: Evaluate MCP Tools
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python eval_tools.py --model gpt-4o-mini --min-score 0.85 --verboseUsing the CLI directly:
# Fail the build if tool-calling accuracy drops below 85%
toolvaluator --server my_server --min-score 0.85Benefits:
- Catch regressions in tool descriptions or schemas
- Ensure model compatibility before deployment
- Track quality metrics over time
- Prevent degradation of tool-calling accuracy
Track tool-calling latency across model versions or configurations:
# Compare latency between models
python eval_tools.py --model gpt-4o-mini --verbose > results_mini.txt
python eval_tools.py --model gpt-4o --verbose > results_4o.txt
# Analyze latency stats from output
grep "Latency stats" results_*.txtCreate specialized evaluation scripts for different scenarios:
# Generate evaluation for production tools
toolvaluator-init --server prod_server --output eval_prod.py
# Generate evaluation for experimental tools
toolvaluator-init --server experimental_server --output eval_experimental.py
# Run both in your test suite
python eval_prod.py --model gpt-4o --min-score 0.90 # High bar for prod
python eval_experimental.py --model gpt-4o --min-score 0.70 # Lower bar for experimentsContributions are welcome! Please feel free to submit a Pull Request.
MIT License
Built with: