Build and evaluate a multi-agent NBA analytics system with Braintrust.
User Question
│
▼
┌─────────────┐
│ Supervisor │ (interprets question, formats response)
│ Agent │
└──────┬──────┘
│ ask_sql_agent
▼
┌─────────────┐
│ SQL Agent │ (writes & executes SQL queries)
└──────┬──────┘
│ run_sql_query / list_tables / describe_table
▼
┌─────────────┐
│ SQLite DB │ (synthetic NBA 2024-25 season data)
└─────────────┘
- Supervisor Agent — understands basketball analytics questions and delegates to the SQL agent
- SQL Agent — translates questions into SQL, executes queries, returns results
- Braintrust AI Proxy — all LLM calls route through
api.braintrust.dev/v1/proxyfor automatic tracing - Braintrust Eval — offline eval suite with custom scorers
- Python 3.10+
- A Braintrust account and API key
-
Clone the repo:
git clone https://github.com/your-org/agent-evals-workshop.git cd agent-evals-workshop -
Create a virtual environment and install dependencies:
python -m venv .venv source .venv/bin/activate pip install -r requirements.txt -
Set up your environment:
cp .env.example .env
- Edit
.envand add yourBRAINTRUST_API_KEY - Edit
BRAINTRUST_PROJECTif multiple people are using the same Braintrust account
- Edit
-
Generate the synthetic database:
python setup_db.py
Ask any NBA analytics question:
python run_agent.py "Which player averages the most rebounds per game this season (minimum 10 games played)?"
python run_agent.py "Who scored the most points this season?"
python run_agent.py "Which team has the most wins this season?"
python run_agent.py "Which player averages the most assists per game?"Traces appear automatically in Braintrust Logs.
Alternatively you can start a chat with the agent by running:
python chat.pyRun this script once to upload an LLM-as-judge scorer and configure it to run on run_sql_query traces.
python setup_online_scorer.pyRun the agent and inspect scoring span in the Braintrust UI.
Upload scorers and dataset to braintrust (only do this once)
python setup_offline_eval.pyRun the full eval suite with custom scorers:
python eval/eval_sql_agent.pyThis runs eval cases through the agent and scores each with:
- data_eval — checks if correct numeric and string values appear in the response
- sql_eval — LLM-as-Judge to check similarity of the generated SQL vs. reference SQL
Results appear in the Braintrust Experiments view.
- Make a new online scorer and configure it to run on a particular span or the whole trace
- Set up remote eval so you can run evals from the UI - start with
eval/eval_sql_agent_remote.pyand follow the instructions here - Make changes to the SQL agent prompt (located in
prompts/) or tool calls and run offline eval to test the changes
agent-evals-workshop/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── setup_db.py # Generate SQLite DB with synthetic NBA data
├── setup_offline_eval.py # Upload scorers and dataset to BT for offline eval
├── setup_online_scorer.py # Upload LLM-as-judge scorer to BT
├── run_agent.py # Invoke agent with a query
├── agents/
│ ├── base_agent.py # Base agent: OpenAI tool-calling loop + tracing
│ ├── sql_agent.py # SQL agent with DB tools
│ └── supervisor_agent.py # Supervisor that delegates to SQL agent
├── tools/
│ └── sql_tools.py # run_sql_query, list_tables, describe_table
├── eval/
│ ├── dataset.json # 12 eval cases with ground truth
│ ├── scorers.py # data_eval + sql_eval scorers
│ ├── eval_sql_agent.py # run offline eval
│ └── eval_sql_agent_remote.py # run remote eval
├── data/
│ └── nba.db # Generated SQLite DB (gitignored)
└── prompts/
├── supervisor_prompt.py
└── sql_prompt.py
The database covers the 2024-25 NBA season (Oct 22, 2024 – Jan 14, 2025) with synthetic data (real team names, fake players and game results).
| Table | Description |
|---|---|
teams |
All 30 NBA teams with conference, division, and arena |
players |
450 players (15 per team) with position, college, draft info |
games |
598 games with scores, attendance, and overtime info |
rosters |
Player-team assignments for the 2024-25 season |
player_game_stats |
Full box score per player per game |
team_game_stats |
Team-level aggregates per game (FG%, 3P%, FT%) |
seasons |
Season date ranges |
| Question | What it tests |
|---|---|
| Who scored the most points this season? | SUM aggregation, JOIN, ORDER BY |
| Which team has the most wins this season? | Conditional counting, JOIN |
| What is the average team score per game? | AVG aggregation |
| Which player averages the most assists per game? | AVG with HAVING for min games |
| How many games went to overtime? | Filtered COUNT |
| Which conference has more wins this season? | Multi-table JOIN, GROUP BY |