Skip to content

Multi-critic alignment auditing: Petri + LLM Tribunal integration for robust AI safety evaluation

Notifications You must be signed in to change notification settings

evalops/petri-tribunal

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Petri + LLM Tribunal Integration

Multi-critic alignment auditing by combining Petri's alignment auditing framework with LLM Tribunal's multi-critic deliberation system.

Why This Integration?

Problem: Single-model judges have blind spots, inconsistent scoring, and can miss subtle alignment issues.

Solution: Use multiple LLM critics that deliberate and reach consensus, catching more issues through cross-validation.

Approach Pros Cons
Single Judge Fast, cheap Blind spots, inconsistent
Multi-Critic More robust, catches more Slower, more expensive
This Integration Best of both Configurable tradeoff

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         Petri                                │
│  ┌─────────┐    ┌─────────┐    ┌─────────────────────────┐  │
│  │ Auditor │───▶│ Target  │───▶│      Scorer            │  │
│  └─────────┘    └─────────┘    │  ┌─────────────────┐   │  │
│                                │  │ tribunal_judge  │   │  │
│                                │  └────────┬────────┘   │  │
│                                └───────────┼────────────┘  │
└────────────────────────────────────────────┼───────────────┘
                                             │
                    ┌────────────────────────▼────────────────────────┐
                    │              LLM Tribunal                        │
                    │  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
                    │  │ Critic 1 │  │ Critic 2 │  │ Critic 3 │       │
                    │  │ (Claude) │  │  (GPT)   │  │ (Haiku)  │       │
                    │  └────┬─────┘  └────┬─────┘  └────┬─────┘       │
                    │       │             │             │              │
                    │       └─────────────┼─────────────┘              │
                    │                     ▼                            │
                    │            ┌─────────────────┐                   │
                    │            │  Deliberation   │                   │
                    │            │   (N rounds)    │                   │
                    │            └────────┬────────┘                   │
                    │                     ▼                            │
                    │            ┌─────────────────┐                   │
                    │            │   Synthesis     │                   │
                    │            │ (vote/average)  │                   │
                    │            └────────┬────────┘                   │
                    │                     ▼                            │
                    │            ┌─────────────────┐                   │
                    │            │   Validation    │                   │
                    │            │   (optional)    │                   │
                    │            └─────────────────┘                   │
                    └─────────────────────────────────────────────────┘

Installation

# Clone the repos
git clone https://github.com/safety-research/petri.git
git clone https://github.com/evalops/llm-tribunal.git

# Install dependencies
cd petri && pip install -e .
cd ../llm-tribunal && pip install -r requirements.txt

# Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

Usage

As a Petri Scorer

from tribunal_judge import tribunal_alignment_judge, TribunalJudgeConfig

# Quick evaluation (2 critics, 1 round, priority dimensions only)
from tribunal_judge import quick_tribunal_judge
scorer = quick_tribunal_judge()

# Thorough evaluation (3 critics, 3 rounds, all dimensions, validation)
from tribunal_judge import thorough_tribunal_judge
scorer = thorough_tribunal_judge()

# Custom configuration
config = TribunalJudgeConfig(
    critic_models=["claude-3-5-sonnet-20241022", "gpt-4o", "gemini-1.5-pro"],
    deliberation_rounds=2,
    aggregation_method="average",  # or "majority_vote"
    validation_enabled=True,
)
scorer = tribunal_alignment_judge(config=config)

In a Petri Eval

from inspect_ai import Task, eval
from petri.tasks.petri import petri_task
from tribunal_judge import tribunal_alignment_judge

task = petri_task(
    scorer=tribunal_alignment_judge(),
    # ... other petri config
)

results = eval(task, model="anthropic/claude-3-5-sonnet-20241022")

Configuration Options

Option Default Description
critic_models Claude + GPT + Haiku List of models for multi-critic evaluation
deliberation_rounds 2 Number of deliberation rounds between critics
aggregation_method "average" How to combine scores: "average" or "majority_vote"
validation_enabled True Whether to validate final scores
validator_model Claude Sonnet Model for validation step

Presets

quick_tribunal_judge()

  • 2 fast models (GPT-4o-mini, Haiku)
  • 1 deliberation round
  • 5 priority dimensions only
  • No validation
  • Use for: Rapid iteration, development, cost-sensitive runs

thorough_tribunal_judge()

  • 3 capable models (Sonnet, GPT-4o, Haiku)
  • 3 deliberation rounds
  • All 30+ dimensions
  • Validation enabled
  • Use for: Final evaluations, important audits, research

Output Format

The scorer returns a Score object compatible with Petri/Inspect:

Score(
    value={
        "concerning": 7,
        "disappointing": 4,
        "needs_attention": 8,
        # ... all dimensions
    },
    answer="Multi-critic evaluation complete. Notable concerns: ...",
    explanation="Detailed reasoning from critics...",
    metadata={
        "critic_models": ["claude-3-5-sonnet", "gpt-4o", "haiku"],
        "deliberation_rounds": 2,
        "dimension_details": {
            "concerning": {
                "confidence": 0.85,
                "individual_scores": [7, 8, 6],
                "vote_distribution": {7: 1, 8: 1, 6: 1},
                "validation_passed": True,
            },
            # ...
        },
    },
)

How It Works

  1. Transcript Formatting: Petri's XML transcript is passed to Tribunal
  2. Per-Dimension Evaluation: Each alignment dimension is evaluated separately
  3. Multi-Critic Deliberation: Multiple LLMs assess and discuss the evidence
  4. Synthesis: Scores are aggregated via voting or averaging
  5. Validation: Optional verification that scores are well-supported
  6. Score Assembly: Results converted back to Petri's Score format

Extending

Adding Custom Dimensions

custom_dimensions = {
    "my_dimension": "Description of what to look for...",
}

scorer = tribunal_alignment_judge(
    dimensions={**DIMENSIONS, **custom_dimensions}
)

Using Different Models

config = TribunalJudgeConfig(
    critic_models=[
        "claude-3-opus-20240229",     # Most capable
        "gpt-4-turbo-preview",         # Strong alternative
        "gemini-1.5-pro",              # Different perspective
    ],
)

Performance Considerations

Configuration API Calls per Transcript Estimated Cost Time
quick_tribunal_judge ~20 $0.10-0.50 30-60s
thorough_tribunal_judge ~300+ $5-15 5-15min
Single judge (baseline) ~1 $0.05-0.20 10-30s

License

MIT - See individual projects for their licenses.

About

Multi-critic alignment auditing: Petri + LLM Tribunal integration for robust AI safety evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages