Eval Concepts

Introduction

This repository contains notebooks demonstrating evaluation concepts for LangGraph agents using LangSmith. The notebooks cover different types of evaluations:

Email Agent Evaluations (email_basic.ipynb, email_mcp.ipynb): Demonstrates evaluation concepts with an email assistant agent that can triage emails and respond appropriately
Multi-Agent Evaluations (multi_thread.ipynb): Demonstrates multi-turn evaluation concepts with a customer service multi-agent system

Pre-work

Create .env file

Create a .env file with the necessary environment variables (e.g., LANGCHAIN_API_KEY, OPENAI_API_KEY, etc.) to run the applications.

Install dependencies

Create a virtual environment

python3 -m venv .venv
source .venv/bin/activate

Install dependencies

pip install -r requirements.txt

Then you're ready to run the notebooks!

The Notebooks

Email Agent Evaluations

`email_basic.ipynb`

This notebook demonstrates three types of evaluations with a basic email assistant agent:

Final Response Evaluations: Evaluating the complete agent output against success criteria
Single Step Evaluations: Evaluating individual steps (e.g., triage classification)
Trajectory Evaluations: Evaluating the sequence of tool calls made by the agent

The email agent (agents/email_basic.py) consists of:

A triage step that classifies emails as "ignore", "respond", or "notify"
A response step that takes actions like checking calendar availability, scheduling meetings, and writing emails

`email_mcp.ipynb`

Similar to email_basic.ipynb, but uses the Model Context Protocol (MCP) version of the email agent (agents/email_mcp.py). This demonstrates how to evaluate agents that use MCP for tool integration.

Multi-Agent Evaluations

`multi_thread.ipynb`

This notebook demonstrates multi-turn evaluations using OpenEvals' simulation capabilities. The multi-agent system (agents/multi_basic.py) is a customer service assistant for a digital music store with:

A supervisor agent that routes queries to specialized sub-agents
Invoice sub-agent: Handles invoice-related queries
Music sub-agent: Handles music catalog queries

The notebook shows how to:

Create simulated user personas
Run multi-turn conversation simulations
Evaluate conversations across multiple turns using various metrics (resolution, satisfaction, professionalism, number of turns)

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
agents		agents
images		images
notebooks		notebooks
tools		tools
utils		utils
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.py		config.py
langgraph.json		langgraph.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Eval Concepts

Introduction

Pre-work

Create .env file

Install dependencies

The Notebooks

Email Agent Evaluations

`email_basic.ipynb`

`email_mcp.ipynb`

Multi-Agent Evaluations

`multi_thread.ipynb`

About

Uh oh!

Releases

Packages

Languages

License

langchain-samples/eval-concepts

Folders and files

Latest commit

History

Repository files navigation

Eval Concepts

Introduction

Pre-work

Create .env file

Install dependencies

The Notebooks

Email Agent Evaluations

email_basic.ipynb

email_mcp.ipynb

Multi-Agent Evaluations

multi_thread.ipynb

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`email_basic.ipynb`

`email_mcp.ipynb`

`multi_thread.ipynb`

Packages