Skip to content

Hands-on workshop: build and evaluate a multi-agent NBA analytics system with Braintrust

Notifications You must be signed in to change notification settings

braintrustdata/agent-evals-workshop

Repository files navigation

Agent Evals Workshop

Build and evaluate a multi-agent NBA analytics system with Braintrust.

Architecture

 User Question
       │
       ▼
┌─────────────┐
│  Supervisor │  (interprets question, formats response)
│    Agent    │
└──────┬──────┘
       │  ask_sql_agent
       ▼
┌─────────────┐
│  SQL Agent  │  (writes & executes SQL queries)
└──────┬──────┘
       │  run_sql_query / list_tables / describe_table
       ▼
┌─────────────┐
│   SQLite DB │  (synthetic NBA 2024-25 season data)
└─────────────┘
  • Supervisor Agent — understands basketball analytics questions and delegates to the SQL agent
  • SQL Agent — translates questions into SQL, executes queries, returns results
  • Braintrust AI Proxy — all LLM calls route through api.braintrust.dev/v1/proxy for automatic tracing
  • Braintrust Eval — offline eval suite with custom scorers

Prerequisites

Setup

  1. Clone the repo:

    git clone https://github.com/your-org/agent-evals-workshop.git
    cd agent-evals-workshop
  2. Create a virtual environment and install dependencies:

    python -m venv .venv
    source .venv/bin/activate
    pip install -r requirements.txt
  3. Set up your environment:

    cp .env.example .env
    • Edit .env and add your BRAINTRUST_API_KEY
    • Edit BRAINTRUST_PROJECT if multiple people are using the same Braintrust account
  4. Generate the synthetic database:

    python setup_db.py

Running the agent

Ask any NBA analytics question:

python run_agent.py "Which player averages the most rebounds per game this season (minimum 10 games played)?"
python run_agent.py "Who scored the most points this season?"
python run_agent.py "Which team has the most wins this season?"
python run_agent.py "Which player averages the most assists per game?"

Traces appear automatically in Braintrust Logs.

Alternatively you can start a chat with the agent by running:

python chat.py

Online scoring

Run this script once to upload an LLM-as-judge scorer and configure it to run on run_sql_query traces.

python setup_online_scorer.py

Run the agent and inspect scoring span in the Braintrust UI.

Offline eval

Upload scorers and dataset to braintrust (only do this once)

python setup_offline_eval.py

Run the full eval suite with custom scorers:

python eval/eval_sql_agent.py

This runs eval cases through the agent and scores each with:

  • data_eval — checks if correct numeric and string values appear in the response
  • sql_eval — LLM-as-Judge to check similarity of the generated SQL vs. reference SQL

Results appear in the Braintrust Experiments view.

Explore further

  • Make a new online scorer and configure it to run on a particular span or the whole trace
  • Set up remote eval so you can run evals from the UI - start with eval/eval_sql_agent_remote.py and follow the instructions here
  • Make changes to the SQL agent prompt (located in prompts/) or tool calls and run offline eval to test the changes

Project structure

agent-evals-workshop/
├── README.md
├── requirements.txt
├── .env.example
├── .gitignore
├── setup_db.py                  # Generate SQLite DB with synthetic NBA data
├── setup_offline_eval.py        # Upload scorers and dataset to BT for offline eval
├── setup_online_scorer.py       # Upload LLM-as-judge scorer to BT
├── run_agent.py                 # Invoke agent with a query
├── agents/
│   ├── base_agent.py            # Base agent: OpenAI tool-calling loop + tracing
│   ├── sql_agent.py             # SQL agent with DB tools
│   └── supervisor_agent.py      # Supervisor that delegates to SQL agent
├── tools/
│   └── sql_tools.py             # run_sql_query, list_tables, describe_table
├── eval/
│   ├── dataset.json             # 12 eval cases with ground truth
│   ├── scorers.py               # data_eval + sql_eval scorers
│   ├── eval_sql_agent.py        # run offline eval
│   └── eval_sql_agent_remote.py # run remote eval
├── data/
│   └── nba.db                   # Generated SQLite DB (gitignored)
└── prompts/
    ├── supervisor_prompt.py
    └── sql_prompt.py

Database schema

The database covers the 2024-25 NBA season (Oct 22, 2024 – Jan 14, 2025) with synthetic data (real team names, fake players and game results).

Table Description
teams All 30 NBA teams with conference, division, and arena
players 450 players (15 per team) with position, college, draft info
games 598 games with scores, attendance, and overtime info
rosters Player-team assignments for the 2024-25 season
player_game_stats Full box score per player per game
team_game_stats Team-level aggregates per game (FG%, 3P%, FT%)
seasons Season date ranges

Sample queries

Question What it tests
Who scored the most points this season? SUM aggregation, JOIN, ORDER BY
Which team has the most wins this season? Conditional counting, JOIN
What is the average team score per game? AVG aggregation
Which player averages the most assists per game? AVG with HAVING for min games
How many games went to overtime? Filtered COUNT
Which conference has more wins this season? Multi-table JOIN, GROUP BY

About

Hands-on workshop: build and evaluate a multi-agent NBA analytics system with Braintrust

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages