Agents-eval

This project aims to implement an evaluation pipeline to assess the effectiveness of open-source agentic AI systems using the PeerRead dataset. Nonetheless intending to focusing on use case agnostic metrics that measure core capabilities such as task decomposition, tool integration, adaptability, and overall performance.

DevEx

Status

(DRAFT) (WIP) ----> Not fully implemented yet

For version history have a look at the CHANGELOG.

Setup and Usage

make setup_prod
make setup_dev or make setup_dev_ollama
make run_cli or make run_cli ARGS="--help"
make run_gui
make test_all

Environment

.env.example contains examples for usage of API keys and variables.

# inference EP
GEMINI_API_KEY="xyz"

# tools
TAVILY_API_KEY=""

# log/mon/trace
WANDB_API_KEY="xyz"

Configuration

config_app.py contains configuration constants for the application.
config_chat.json contains inference provider configuration and prompts. inference endpoints used should adhere to OpenAI Model Spec 2024-05-08 which is used by pydantic-ai OpenAI-compatible Models.
config_eval.json contains evaluation metrics and their weights.
data_models.py contains the pydantic data models for agent system configuration and results.

Note

The contained chat configuration uses free inference endpoints which are subject to change by the providers. See lists such as free-llm-api-resources to find other providers.
The contained chat configuration uses models which are also subject to change by the providers and have to be updated from time to time.
LLM-as-judge is also subject to the chat configuration.

Documentation

Agents-eval

Project Outline

# TODO

Customer Journey and User Story

Have a look at the example user story.

Show Customer Journey

Agents

Manager Agent

Description: Oversees research and analysis tasks, coordinating the efforts of the research, analysis, and synthesizer agents to provide comprehensive answers to user queries. Delegates tasks and ensures the accuracy of the information.
Responsibilities:
- Coordinates the research, analysis, and synthesis agents.
- Delegates research tasks to the Research Agent.
- Delegates analysis tasks to the Analysis Agent.
- Delegates synthesis tasks to the Synthesizer Agent.
- Ensures the accuracy of the information.
Location: src/app/agents/agent_system.py

Researcher Agent

Description: Gathers and analyzes data relevant to a given topic, utilizing search tools to collect data and verifying the accuracy of assumptions, facts, and conclusions.
Responsibilities:
- Gathers and analyzes data relevant to the topic.
- Uses search tools to collect data.
- Checks the accuracy of assumptions, facts, and conclusions.
Tools:
- DuckDuckGo Search Tool
Location: src/app/agents/agent_system.py

Analyst Agent

Description: Checks the accuracy of assumptions, facts, and conclusions in the provided data, providing relevant feedback and ensuring data integrity.
Responsibilities:
- Checks the accuracy of assumptions, facts, and conclusions.
- Provides relevant feedback if the result is not approved.
- Ensures data integrity.
Location: src/app/agents/agent_system.py

Synthesizer Agent

Description: Outputs a well-formatted scientific report using the data provided, maintaining the original facts, conclusions, and sources.
Responsibilities:
- Outputs a well-formatted scientific report using the provided data.
- Maintains the original facts, conclusions, and sources.
Location: src/app/agents/agent_system.py

Dataset used

PeerRead Scientific Paper Review Dataset

The system includes comprehensive integration with the PeerRead dataset for scientific paper review evaluation:

Purpose: Generate and evaluate scientific paper reviews using the Multi-Agent System
Architecture: Clean separation between review generation (MAS) and evaluation (external system)
Workflow:
1. MAS: PDF → Review Generation → Persistent Storage (src/app/data_utils/reviews/)
2. External Evaluation: Load Reviews → Similarity Analysis → Results
Documentation: See PeerRead Agent Usage Guide

Review Workflow

Show Review Workflow

LLM-as-a-Judge

# TODO

Custom Evaluations Metrics Baseline

As configured in config_eval.json.

{
    "evaluators_and_weights": {
        "planning_rational": "1/6",
        "task_success": "1/6",
        "tool_efficiency": "1/6",
        "coordination_quality": "1/6",
        "time_taken": "1/6",
        "text_similarity": "1/6"
    }
}

Eval Metrics Sweep

Tools available

Other pydantic-ai agents and pydantic-ai DuckDuckGo Search Tool.

Agentic System Architecture

Show MAS Overview

Show MAS Detailed

Project Repo Structure

|- .claude  # AI agent framework and commands
   |- commands
      |- generate-frp.md  # FRP generation command
      \- execute-frp.md   # FRP execution command
|- .devcontainer  # pre-configured dev env
|- .github  # workflows
|- .streamlit  # config.toml
|- .vscode  # extensions, settings
|- assets/images
|- context  # AI agent context framework
   |- config
      \- paths.md  # path variables and definitions
   |- templates
      \- 2_frp_base.md  # FRP template with quality framework
   |- features  # feature descriptions for FRP generation
   |- FRPs  # generated feature requirements prompts
   |- examples  # code patterns and examples
   \- logs  # agent execution logs
|- docs
|- src  # source code
   |- app
      |- agents
      |- config
      |- evals
      |- utils
      |- __init__.py
      |- main.py
      \- py.typed
   |- examples
   |- gui
   \- run_gui.py
|- tests
|- .env.example  # example env vars
|- .gitignore
|- .gitmessage
|- AGENTS.md  # north star document for AI agents (agentsmd.com)
|- CHANGEOG.md  # short project history
|- CLAUDE.md  # points to AGENTS.md
|- Dockerfile  # create app image
|- LICENSE.md
|- Makefile  # helper scripts
|- mkdocs.yaml  # docu from docstrings
|- pyproject.toml  # project settings
|- README.md  # project description
\- uv.lock  # resolved package versions

Landscape overview

Agentic System Frameworks

Agent-builder

Evaluation

Focusing on agentic systems
- AgentNeo
- AutoGenBench
- Langchain AgentEvals, trajectory or LLM-as-a-judge
- Mosaic AI Agent Evaluation
- RagaAI-Catalyst
- AgentBench
RAG oriented
- RAGAs
LLM apps

Observation, Monitoring, Tracing

Datasets

awesome-reasoning - Collection of datasets

Scientific

SWIF2T, Automated Focused Feedback Generation for Scientific Writing Assistance, 2024, 300 peer reviews citing weaknesses in scientific papers and conduct human evaluation
PeerRead, A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications, 2018, 14K paper drafts and the corresponding accept/reject decisions, over 10K textual peer reviews written by experts for a subset of the papers, structured JSONL, clear labels, See A Dataset of Peer Reviews (PeerRead):Collection, Insights and NLP Applications
BigSurvey, Generating a Structured Summary of Numerous Academic Papers: Dataset and Method, 2022, 7K survey papers and 430K referenced papers abstracts
SciXGen, A Scientific Paper Dataset for Context-Aware Text Generation, 2021, 205k papers
scientific_papers, 2018, two sets of long and structured documents, obtained from ArXiv and PubMed OpenAccess, 300k+ papers, total disk 7GB

Reasoning, Deduction, Commonsense, Logic

LIAR, fake news detection, only 12.8k records, single label
X-Fact, Benchmark Dataset for Multilingual Fact Checking, 31.1k records, large, multilingual
MultiFC, A Real-World Multi-Domain Dataset for Evidence-Based Fact Checking of Claims, 34.9k records
FEVER, Fact Extraction and VERification, 185.4k records
TODO GSM8K, bAbI, CommonsenseQA, DROP, LogiQA, MNLI

Planning, Execution

Plancraft, an evaluation dataset for planning with LLM agents, both a text-only and multi-modal interface
IDAT, A Multi-Modal Dataset and Toolkit for Building and Evaluating Interactive Task-Solving Agents
PDEBench, set of benchmarks for scientific machine learning
MatSci-NLP, evaluating the performance of natural language processing (NLP) models on materials science text
TODO BigBench Hard, FSM Game

Tool Use, Function Invocation

Trelis Function Calling
KnowLM Tool
StatLLM, statistical analysis tasks, LLM-generated SAS code, and human evaluation scores
TODO ToolComp

Benchmarks

Research Agents

Ai2 Scholar QA

Note: Context Framework for AI Agents

This project includes a comprehensive context framework for AI coding agents designed for structured development and collaboration. It supports feature implementation using a top-down approach where feature descriptions are transformed into Feature Request Prompts (FRPs) and then into code implementation.

Documentation Hierarchy

The framework uses a layered documentation approach:

CLAUDE.md (entry point)
    ↓
AGENTS.md (core agent instructions)
    ↓
├── CONTRIBUTE.md (development workflows & standards)
├── AGENT_REQUESTS.md (human escalation & collaboration)
└── AGENT_LEARNINGS.md (pattern discovery & knowledge sharing)

CLI/Extensions used

Core Components

AGENTS.md: Core agent instructions with project patterns, conventions, and decision framework
CONTRIBUTE.md: Development workflows, coding standards, and collaboration guidelines
AGENT_REQUESTS.md: Human escalation process and active collaboration requests
AGENT_LEARNINGS.md: Accumulated patterns, solutions, and knowledge sharing
FRP Workflow: Feature Requirements Prompt generation and execution system
1. context/templates/1_feature_description.md: User provides feature description, e.g., by using this template
2. .claude/commands/generate-frp.md: Creates comprehensive implementation prompts from feature descriptions
3. .claude/commands/execute-frp.md: Executes features using generated FRPs with structured validation

Agent Development Workflow

Follow AGENTS.md - Read project conventions, patterns, and quality standards
Generate FRP - Use generate-frp.md command for comprehensive feature planning and research
Execute Implementation - Use execute-frp.md command for structured development with quality gates

Quality Framework Integration

Built-in quality evaluation with minimum thresholds (Context: 8/10, Clarity: 7/10, Alignment: 8/10, Success: 7/10)
BDD/TDD approach integration following project patterns
Automatic validation using unified command reference with error recovery
TodoWrite tool integration for progress tracking and transparency

For AI Agents: Quick Start

Read the North Star: Start with AGENTS.md for project patterns and conventions
Generate FRP: Use /generate-frp <feature-name> command in Claude Code
Execute Implementation: Use /execute-frp <feature-name> command with generated FRP
Follow Quality Gates: Ensure all AGENTS.md thresholds are met before proceeding

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
.claude		.claude
.cline		.cline
.devcontainer		.devcontainer
.gemini		.gemini
.github		.github
.streamlit		.streamlit
.vscode		.vscode
assets/images		assets/images
context		context
docs		docs
scripts		scripts
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.gitmessage		.gitmessage
AGENTS.md		AGENTS.md
AGENT_LEARNINGS.md		AGENT_LEARNINGS.md
AGENT_REQUESTS.md		AGENT_REQUESTS.md
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
CONTRIBUTE.md		CONTRIBUTE.md
Dockerfile		Dockerfile
GEMINI.md		GEMINI.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

License

qte77/Agents-eval

Folders and files

Latest commit

History

Repository files navigation

Agents-eval

Status

Setup and Usage

Environment

Configuration

Note

Documentation

Project Outline

Customer Journey and User Story

Agents

Manager Agent

Researcher Agent

Analyst Agent

Synthesizer Agent

Dataset used

PeerRead Scientific Paper Review Dataset

Review Workflow

LLM-as-a-Judge

Custom Evaluations Metrics Baseline

Eval Metrics Sweep

Tools available

Agentic System Architecture

Project Repo Structure

Landscape overview

Agentic System Frameworks

Agent-builder

Evaluation

Observation, Monitoring, Tracing

Datasets

Scientific

Reasoning, Deduction, Commonsense, Logic

Planning, Execution

Tool Use, Function Invocation

Benchmarks

Research Agents

Further Reading

Note: Context Framework for AI Agents

Documentation Hierarchy

CLI/Extensions used

Core Components

Agent Development Workflow

Quality Framework Integration

For AI Agents: Quick Start

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Uh oh!

Contributors 3

Uh oh!

Languages