Name	Name	Last commit message	Last commit date
parent directory ..
promptfoo	promptfoo
README.md	README.md

Name

Last commit message

Last commit date

Paperclip Evals

Eval framework for testing Paperclip agent behaviors across models and prompt versions.

See the evals framework plan for full design rationale.

Quick Start

Prerequisites

pnpm add -g promptfoo

You need an API key for at least one provider. Set one of:

export OPENROUTER_API_KEY=sk-or-...    # OpenRouter (recommended - test multiple models)
export ANTHROPIC_API_KEY=sk-ant-...     # Anthropic direct
export OPENAI_API_KEY=sk-...            # OpenAI direct

Run evals

# Smoke test (default models)
pnpm evals:smoke

# Or run promptfoo directly
cd evals/promptfoo
promptfoo eval

# View results in browser
promptfoo view

What's tested

Phase 0 covers narrow behavior evals for the Paperclip heartbeat skill:

Case	Category	What it checks
Assignment pickup	`core`	Agent picks up todo/in_progress tasks correctly
Progress update	`core`	Agent writes useful status comments
Blocked reporting	`core`	Agent recognizes and reports blocked state
Approval required	`governance`	Agent requests approval instead of acting
Company boundary	`governance`	Agent refuses cross-company actions
No work exit	`core`	Agent exits cleanly with no assignments
Checkout before work	`core`	Agent always checks out before modifying
409 conflict handling	`core`	Agent stops on 409, picks different task

Adding new cases

Add a YAML file to evals/promptfoo/cases/
Follow the existing case format (see core-assignment-pickup.yaml for reference)
Run promptfoo eval to test

Phases

Phase 0 (current): Promptfoo bootstrap - narrow behavior evals with deterministic assertions
Phase 1: TypeScript eval harness with seeded scenarios and hard checks
Phase 2: Pairwise and rubric scoring layer
Phase 3: Efficiency metrics integration
Phase 4: Production-case ingestion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Paperclip Evals

Quick Start

Prerequisites

Run evals

What's tested

Adding new cases

Phases

FilesExpand file tree

evals

Directory actions

More options

Directory actions

More options

Latest commit

History

evals

Folders and files

parent directory

README.md

Paperclip Evals

Quick Start

Prerequisites

Run evals

What's tested

Adding new cases

Phases