A complete A2A (Agent-to-Agent) compatible implementation of the DABench benchmark, following the AgentBeats methodology with Green Agent (evaluator) and Purple Agent (test subject) architecture.
The Data Agent Benchmark (DABench) is designed to measure and push the state-of-the-art in Data Analysis tasks for AI agents.
This project implements the DABench benchmark as an A2A-compatible evaluation system where:
- Green Agent (Evaluator): Manages DABench assessments and evaluates other agents
- Purple Agent (Test Subject): The agent being evaluated, with embedded Jupyter MCP capabilities
- ✅ A2A Protocol Compatible: Full compatibility with Agent-to-Agent standard using Pydantic FastA2A
- ✅ AgentBeats Architecture: Proper green/purple agent separation as per AgentBeats guidelines
- ✅ DABench Scoring: DABench benchmark dataset
- ✅ PydanticA AI Agent and Evaluation: Utilizes Pydantic AI for agent and evaluation
- ✅ LLM-as-Judge Evaluation: Green Agent uses GPT-4o (Azure OpenAI or OpenAI) as an LLM judge to evaluate Purple Agent responses
- ✅ Configurable Purple Agent: Purple Agent model is fully configurable via
PURPLE_AGENT_MODELenvironment variable (supports OpenAI, Azure, Anthropic, Bedrock, etc.) - ✅ Embedded MCP Tools: Purple agent includes embedded jupyter-mcp-server for autonomous code execution
The Green and Purple agents have been deployed to the AgentBeats platform.
- Green Agent (Evaluator) URL: https://agentbeats.dev/eleonorecharles/dabench-evaluator
- Purple Agent (Test Subject) URL: https://agentbeats.dev/eleonorecharles/dabench-agent
To score other purple agents using this Green Agent evaluator, use this repository: https://github.com/datalayer-challenges/dabench-leaderboard, modify the scenario.toml and push to a new branch. This will trigger the evaluation workflow in Github Actions. Once complete, submit a PR and if approved, your agent will be added to the leaderboard.
More details on AgentBeats submission can be found in: https://docs.agentbeats.dev/tutorial.
The system uses environment variables for configuration. Copy .env.template to .env and configure:
# Model Configuration
# Green Agent: Always uses GPT-4o (auto-detects Azure vs OpenAI based on available keys)
# Purple Agent: Configurable via PURPLE_AGENT_MODEL (supports multiple providers)
PURPLE_AGENT_MODEL=openai:gpt-4o # Purple Agent model (configurable)
# Alternative Purple Agent provider examples:
# PURPLE_AGENT_MODEL=azure:gpt-4 # Azure OpenAI (requires endpoint below)
# PURPLE_AGENT_MODEL=anthropic:claude-3-sonnet # Anthropic Claude
# PURPLE_AGENT_MODEL=bedrock:anthropic.claude-sonnet-4-5-20250929-v1:0 # AWS Bedrock
# PURPLE_AGENT_MODEL=gemini:gemini-pro # Google Gemini
# PURPLE_AGENT_MODEL=groq:llama3-70b # Groq
# Azure OpenAI Configuration (when using azure: models)
AZURE_OPENAI_API_KEY=your_azure_openai_api_key
AZURE_OPENAI_ENDPOINT=https://your-resource.openai.azure.com/
OPENAI_API_VERSION=2024-06-01
# OpenAI Configuration (when using openai: models)
OPENAI_API_KEY=your_openai_api_key
# AWS Bedrock Configuration (when using bedrock: models)
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_DEFAULT_REGION=us-east-1-
Green Agent (Evaluator):
- Always uses
gpt-4ofor consistent evaluation via LLM-as-judge - Automatically selects
azure:gpt-4oifAZURE_OPENAI_API_KEYis available - Falls back to
openai:gpt-4oifOPENAI_API_KEYis available - This ensures consistent scoring across evaluations
- Always uses
-
Purple Agent (Test Subject):
- Fully configurable via
PURPLE_AGENT_MODELenvironment variable - Supports OpenAI, Azure OpenAI, Anthropic, AWS Bedrock, Google Gemini, Groq
- Allows testing different models while maintaining evaluation consistency
- Fully configurable via
# Install dependencies
pip install -r requirements.txt
# Full dataset evaluation
python launcher.py --evaluate --full
# Quick sample evaluation (3 tasks)
python launcher.py --evaluate --quick-sample 3For better control and monitoring, use the 3-terminal workflow:
# Install dependencies
pip install -r requirements.txt
# Terminal 1: Start Purple Agent (Test Subject with embedded MCP)
make start-purple
# Terminal 2: Start Green Agent (Evaluator)
make start-green
# Terminal 3: Run Evaluation
make run-eval-monitor # Full evaluation with real-time monitoring
make run-eval-quick-monitor # Quick 3-task evaluation with real-time monitoringFor containerized development, use the Docker workflow:
# Build all Docker images (run this first or after code changes)
make docker-build-all# Start services in separate terminals:
# Terminal 1: Start Purple Agent with embedded MCP (Test Subject)
make docker-start-purple
# Terminal 2: Start Green Agent (Evaluator)
make docker-start-green
# Terminal 3: Run Evaluation (on Docker containers)
make docker-run-eval-quick-monitor # Quick 3-task evaluation with monitoring
make docker-run-eval-monitor # Full dataset evaluation with monitoring# Start services in separate terminals:
# Terminal 1: Start Purple Agent with embedded MCP (Test Subject)
make docker-start-purple-linux
# Terminal 2: Start Green Agent (Evaluator)
make docker-start-green-linux
# Terminal 4: Run Evaluation (on Docker containers)
make docker-run-eval-quick-monitor-linux # Quick 3-task evaluation with monitoring
make docker-run-eval-monitor-linux # Full dataset evaluation with monitoringAfter evaluation, Pydantic AI generates a detailed report in the results/ directory, including:
- Overall scores and metrics
- Task-by-task performance breakdown
- Green Agent reason for scores
The Purple Agent includes an embedded Jupyter MCP (Model Context Protocol) Server for enhanced data analysis and code execution capabilities. This embedded server provides the agent with powerful computational tools for autonomous problem-solving without requiring a separate container.
The Purple Agent automatically starts its own Jupyter MCP server:
# Purple Agent starts embedded Jupyter MCP server on initialization
await self._start_embedded_jupyter_mcp()
# Automatically generates tokens and manages server lifecycle
# No external MCP server setup required- Agent Port: 9019 (Purple Agent A2A endpoint)
- Embedded MCP Port: 8888 (JupyterLab standard)
- MCP Endpoints:
http://localhost:8888/mcp/* - Health Check:
http://localhost:8888/mcp/healthz - Tools List:
http://localhost:8888/mcp/tools/list - Tool Execution:
http://localhost:8888/mcp/tools/call
The evaluation process involves two data components:
- Data Files: 68 diverse CSV datasets in
src/purple/agent-workings/data/available for the Purple Agent to analyze through the Jupyter MCP Server - Task Distribution: Green Agent sends tasks one-by-one from the 257 DABench evaluation tasks stored in
data-dabench/

