DABench AgentBeats Implementation

A complete A2A (Agent-to-Agent) compatible implementation of the DABench benchmark, following the AgentBeats methodology with Green Agent (evaluator) and Purple Agent (test subject) architecture.

The Data Agent Benchmark (DABench) is designed to measure and push the state-of-the-art in Data Analysis tasks for AI agents.

Overview

This project implements the DABench benchmark as an A2A-compatible evaluation system where:

Green Agent (Evaluator): Manages DABench assessments and evaluates other agents
Purple Agent (Test Subject): The agent being evaluated, with embedded Jupyter MCP capabilities

Features

✅ A2A Protocol Compatible: Full compatibility with Agent-to-Agent standard using Pydantic FastA2A
✅ AgentBeats Architecture: Proper green/purple agent separation as per AgentBeats guidelines
✅ DABench Scoring: DABench benchmark dataset
✅ PydanticA AI Agent and Evaluation: Utilizes Pydantic AI for agent and evaluation
✅ LLM-as-Judge Evaluation: Green Agent uses GPT-4o (Azure OpenAI or OpenAI) as an LLM judge to evaluate Purple Agent responses
✅ Configurable Purple Agent: Purple Agent model is fully configurable via PURPLE_AGENT_MODEL environment variable (supports OpenAI, Azure, Anthropic, Bedrock, etc.)
✅ Embedded MCP Tools: Purple agent includes embedded jupyter-mcp-server for autonomous code execution

AgentBeats Usage

The Green and Purple agents have been deployed to the AgentBeats platform.

Green Agent (Evaluator) URL: https://agentbeats.dev/eleonorecharles/dabench-evaluator
Purple Agent (Test Subject) URL: https://agentbeats.dev/eleonorecharles/dabench-agent

To score other purple agents using this Green Agent evaluator, use this repository: https://github.com/datalayer-challenges/dabench-leaderboard, modify the scenario.toml and push to a new branch. This will trigger the evaluation workflow in Github Actions. Once complete, submit a PR and if approved, your agent will be added to the leaderboard.

More details on AgentBeats submission can be found in: https://docs.agentbeats.dev/tutorial.

Local Development Setup

Configuration

The system uses environment variables for configuration. Copy .env.template to .env and configure:

# Model Configuration
# Green Agent: Always uses GPT-4o (auto-detects Azure vs OpenAI based on available keys)
# Purple Agent: Configurable via PURPLE_AGENT_MODEL (supports multiple providers)

PURPLE_AGENT_MODEL=openai:gpt-4o            # Purple Agent model (configurable)

# Alternative Purple Agent provider examples:
# PURPLE_AGENT_MODEL=azure:gpt-4             # Azure OpenAI (requires endpoint below)
# PURPLE_AGENT_MODEL=anthropic:claude-3-sonnet # Anthropic Claude
# PURPLE_AGENT_MODEL=bedrock:anthropic.claude-sonnet-4-5-20250929-v1:0 # AWS Bedrock
# PURPLE_AGENT_MODEL=gemini:gemini-pro       # Google Gemini
# PURPLE_AGENT_MODEL=groq:llama3-70b         # Groq

# Azure OpenAI Configuration (when using azure: models)
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
OPENAI_API_VERSION=2024-06-01

# OpenAI Configuration (when using openai: models)
OPENAI_API_KEY=your_openai_api_key

# AWS Bedrock Configuration (when using bedrock: models)
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_DEFAULT_REGION=us-east-1

Model Selection Logic

Green Agent (Evaluator):
- Always uses gpt-4o for consistent evaluation via LLM-as-judge
- Automatically selects azure:gpt-4o if AZURE_OPENAI_API_KEY is available
- Falls back to openai:gpt-4o if OPENAI_API_KEY is available
- This ensures consistent scoring across evaluations
Purple Agent (Test Subject):
- Fully configurable via PURPLE_AGENT_MODEL environment variable
- Supports OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, Google Gemini, Groq
- Allows testing different models while maintaining evaluation consistency

Option 1: All-in-One Launcher Script

# Install dependencies
pip install -r requirements.txt

# Full dataset evaluation
python launcher.py --evaluate --full

# Quick sample evaluation (3 tasks)
python launcher.py --evaluate --quick-sample 3

Option 2: Separate Services with Makefile

For better control and monitoring, use the 3-terminal workflow:

# Install dependencies
pip install -r requirements.txt

# Terminal 1: Start Purple Agent (Test Subject with embedded MCP)  
make start-purple

# Terminal 2: Start Green Agent (Evaluator)
make start-green

# Terminal 3: Run Evaluation
make run-eval-monitor           # Full evaluation with real-time monitoring
make run-eval-quick-monitor     # Quick 3-task evaluation with real-time monitoring

Option 3: Docker Development

For containerized development, use the Docker workflow:

# Build all Docker images (run this first or after code changes)
make docker-build-all

macOS/Windows Docker Workflow

# Start services in separate terminals:
# Terminal 1: Start Purple Agent with embedded MCP (Test Subject)
make docker-start-purple

# Terminal 2: Start Green Agent (Evaluator)
make docker-start-green

# Terminal 3: Run Evaluation (on Docker containers)
make docker-run-eval-quick-monitor     # Quick 3-task evaluation with monitoring
make docker-run-eval-monitor           # Full dataset evaluation with monitoring

Linux Docker Workflow

# Start services in separate terminals:
# Terminal 1: Start Purple Agent with embedded MCP (Test Subject)
make docker-start-purple-linux

# Terminal 2: Start Green Agent (Evaluator)
make docker-start-green-linux

# Terminal 4: Run Evaluation (on Docker containers)
make docker-run-eval-quick-monitor-linux     # Quick 3-task evaluation with monitoring
make docker-run-eval-monitor-linux           # Full dataset evaluation with monitoring

Evaluation Results

After evaluation, Pydantic AI generates a detailed report in the results/ directory, including:

Overall scores and metrics
Task-by-task performance breakdown
Green Agent reason for scores

Embedded Jupyter MCP Integration

The Purple Agent includes an embedded Jupyter MCP (Model Context Protocol) Server for enhanced data analysis and code execution capabilities. This embedded server provides the agent with powerful computational tools for autonomous problem-solving without requiring a separate container.

Automatic Startup

The Purple Agent automatically starts its own Jupyter MCP server:

# Purple Agent starts embedded Jupyter MCP server on initialization
await self._start_embedded_jupyter_mcp()

# Automatically generates tokens and manages server lifecycle
# No external MCP server setup required

Network Configuration

Agent Port: 9019 (Purple Agent A2A endpoint)
Embedded MCP Port: 8888 (JupyterLab standard)
MCP Endpoints: http://localhost:8888/mcp/*
Health Check: http://localhost:8888/mcp/healthz
Tools List: http://localhost:8888/mcp/tools/list
Tool Execution: http://localhost:8888/mcp/tools/call

Data Context Files & DABench Tasks

The evaluation process involves two data components:

Data Files: 68 diverse CSV datasets in src/purple/agent-workings/data/ available for the Purple Agent to analyze through the Jupyter MCP Server
Task Distribution: Green Agent sends tasks one-by-one from the 257 DABench evaluation tasks stored in data-dabench/

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
data-dabench		data-dabench
img		img
results		results
src		src
.env.template		.env.template
.gitignore		.gitignore
Dockerfile.green		Dockerfile.green
Dockerfile.purple		Dockerfile.purple
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
launcher.py		launcher.py
requirements.txt		requirements.txt
send_evaluation.py		send_evaluation.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

DABench AgentBeats Implementation

Overview

Features

AgentBeats Usage

Local Development Setup

Configuration

Model Selection Logic

Option 1: All-in-One Launcher Script

Option 2: Separate Services with Makefile

Option 3: Docker Development

macOS/Windows Docker Workflow

Linux Docker Workflow

Evaluation Results

Embedded Jupyter MCP Integration

Automatic Startup

Network Configuration

Data Context Files & DABench Tasks

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

DABench AgentBeats Implementation

Overview

Features

AgentBeats Usage

Local Development Setup

Configuration

Model Selection Logic

Option 1: All-in-One Launcher Script

Option 2: Separate Services with Makefile

Option 3: Docker Development

macOS/Windows Docker Workflow

Linux Docker Workflow

Evaluation Results

Embedded Jupyter MCP Integration

Automatic Startup

Network Configuration

Data Context Files & DABench Tasks

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages