Agent Evals Workshop

Build and evaluate a multi-agent NBA analytics system with Braintrust.

Architecture

 User Question
       │
       ▼
┌─────────────┐
│  Supervisor │  (interprets question, formats response)
│    Agent    │
└──────┬──────┘
       │  ask_sql_agent
       ▼
┌─────────────┐
│  SQL Agent  │  (writes & executes SQL queries)
└──────┬──────┘
       │  run_sql_query / list_tables / describe_table
       ▼
┌─────────────┐
│   SQLite DB │  (synthetic NBA 2024-25 season data)
└─────────────┘

Supervisor Agent — understands basketball analytics questions and delegates to the SQL agent
SQL Agent — translates questions into SQL, executes queries, returns results
Braintrust AI Proxy — all LLM calls route through api.braintrust.dev/v1/proxy for automatic tracing
Braintrust Eval — offline eval suite with custom scorers

Prerequisites

Python 3.10+
A Braintrust account and API key

Setup

Clone the repo:

git clone https://github.com/your-org/agent-evals-workshop.git
cd agent-evals-workshop

Create a virtual environment and install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Set up your environment:
```
cp .env.example .env
```
- Edit .env and add your BRAINTRUST_API_KEY
- Edit BRAINTRUST_PROJECT if multiple people are using the same Braintrust account
Generate the synthetic database:
```
python setup_db.py
```

Running the agent

Ask any NBA analytics question:

python run_agent.py "Which player averages the most rebounds per game this season (minimum 10 games played)?"
python run_agent.py "Who scored the most points this season?"
python run_agent.py "Which team has the most wins this season?"
python run_agent.py "Which player averages the most assists per game?"

Traces appear automatically in Braintrust Logs.

Alternatively you can start a chat with the agent by running:

python chat.py

Online scoring

Run this script once to upload an LLM-as-judge scorer and configure it to run on run_sql_query traces.

python setup_online_scorer.py

Run the agent and inspect scoring span in the Braintrust UI.

Offline eval

Upload scorers and dataset to braintrust (only do this once)

python setup_offline_eval.py

Run the full eval suite with custom scorers:

python eval/eval_sql_agent.py

This runs eval cases through the agent and scores each with:

data_eval — checks if correct numeric and string values appear in the response
sql_eval — LLM-as-Judge to check similarity of the generated SQL vs. reference SQL

Results appear in the Braintrust Experiments view.

Explore further

Make a new online scorer and configure it to run on a particular span or the whole trace
Set up remote eval so you can run evals from the UI - start with eval/eval_sql_agent_remote.py and follow the instructions here
Make changes to the SQL agent prompt (located in prompts/) or tool calls and run offline eval to test the changes

Project structure

agent-evals-workshop/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── setup_db.py                  # Generate SQLite DB with synthetic NBA data
├── setup_offline_eval.py        # Upload scorers and dataset to BT for offline eval
├── setup_online_scorer.py       # Upload LLM-as-judge scorer to BT
├── run_agent.py                 # Invoke agent with a query
├── agents/
│   ├── base_agent.py            # Base agent: OpenAI tool-calling loop + tracing
│   ├── sql_agent.py             # SQL agent with DB tools
│   └── supervisor_agent.py      # Supervisor that delegates to SQL agent
├── tools/
│   └── sql_tools.py             # run_sql_query, list_tables, describe_table
├── eval/
│   ├── dataset.json             # 12 eval cases with ground truth
│   ├── scorers.py               # data_eval + sql_eval scorers
│   ├── eval_sql_agent.py        # run offline eval
│   └── eval_sql_agent_remote.py # run remote eval
├── data/
│   └── nba.db                   # Generated SQLite DB (gitignored)
└── prompts/
    ├── supervisor_prompt.py
    └── sql_prompt.py

Database schema

The database covers the 2024-25 NBA season (Oct 22, 2024 – Jan 14, 2025) with synthetic data (real team names, fake players and game results).

Table	Description
`teams`	All 30 NBA teams with conference, division, and arena
`players`	450 players (15 per team) with position, college, draft info
`games`	598 games with scores, attendance, and overtime info
`rosters`	Player-team assignments for the 2024-25 season
`player_game_stats`	Full box score per player per game
`team_game_stats`	Team-level aggregates per game (FG%, 3P%, FT%)
`seasons`	Season date ranges

Sample queries

Question	What it tests
Who scored the most points this season?	SUM aggregation, JOIN, ORDER BY
Which team has the most wins this season?	Conditional counting, JOIN
What is the average team score per game?	AVG aggregation
Which player averages the most assists per game?	AVG with HAVING for min games
How many games went to overtime?	Filtered COUNT
Which conference has more wins this season?	Multi-table JOIN, GROUP BY

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Agent Evals Workshop

Architecture

Prerequisites

Setup

Running the agent

Online scoring

Offline eval

Explore further

Project structure

Database schema

Sample queries

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
agents		agents
eval		eval
prompts		prompts
tools		tools
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
chat.py		chat.py
requirements.txt		requirements.txt
run_agent.py		run_agent.py
setup_db.py		setup_db.py
setup_offline_eval.py		setup_offline_eval.py
setup_online_scorer.py		setup_online_scorer.py

braintrustdata/agent-evals-workshop

Folders and files

Latest commit

History

Repository files navigation

Agent Evals Workshop

Architecture

Prerequisites

Setup

Running the agent

Online scoring

Offline eval

Explore further

Project structure

Database schema

Sample queries

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages