Petri + LLM Tribunal Integration

Multi-critic alignment auditing by combining Petri's alignment auditing framework with LLM Tribunal's multi-critic deliberation system.

Why This Integration?

Problem: Single-model judges have blind spots, inconsistent scoring, and can miss subtle alignment issues.

Solution: Use multiple LLM critics that deliberate and reach consensus, catching more issues through cross-validation.

Approach	Pros	Cons
Single Judge	Fast, cheap	Blind spots, inconsistent
Multi-Critic	More robust, catches more	Slower, more expensive
This Integration	Best of both	Configurable tradeoff

Architecture

┌─────────────────────────────────────────────────────────────┐
│                         Petri                                │
│  ┌─────────┐    ┌─────────┐    ┌─────────────────────────┐  │
│  │ Auditor │───▶│ Target  │───▶│      Scorer            │  │
│  └─────────┘    └─────────┘    │  ┌─────────────────┐   │  │
│                                │  │ tribunal_judge  │   │  │
│                                │  └────────┬────────┘   │  │
│                                └───────────┼────────────┘  │
└────────────────────────────────────────────┼───────────────┘
                                             │
                    ┌────────────────────────▼────────────────────────┐
                    │              LLM Tribunal                        │
                    │  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
                    │  │ Critic 1 │  │ Critic 2 │  │ Critic 3 │       │
                    │  │ (Claude) │  │  (GPT)   │  │ (Haiku)  │       │
                    │  └────┬─────┘  └────┬─────┘  └────┬─────┘       │
                    │       │             │             │              │
                    │       └─────────────┼─────────────┘              │
                    │                     ▼                            │
                    │            ┌─────────────────┐                   │
                    │            │  Deliberation   │                   │
                    │            │   (N rounds)    │                   │
                    │            └────────┬────────┘                   │
                    │                     ▼                            │
                    │            ┌─────────────────┐                   │
                    │            │   Synthesis     │                   │
                    │            │ (vote/average)  │                   │
                    │            └────────┬────────┘                   │
                    │                     ▼                            │
                    │            ┌─────────────────┐                   │
                    │            │   Validation    │                   │
                    │            │   (optional)    │                   │
                    │            └─────────────────┘                   │
                    └─────────────────────────────────────────────────┘

Installation

# Clone the repos
git clone https://github.com/safety-research/petri.git
git clone https://github.com/evalops/llm-tribunal.git

# Install dependencies
cd petri && pip install -e .
cd ../llm-tribunal && pip install -r requirements.txt

# Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

Usage

As a Petri Scorer

from tribunal_judge import tribunal_alignment_judge, TribunalJudgeConfig

# Quick evaluation (2 critics, 1 round, priority dimensions only)
from tribunal_judge import quick_tribunal_judge
scorer = quick_tribunal_judge()

# Thorough evaluation (3 critics, 3 rounds, all dimensions, validation)
from tribunal_judge import thorough_tribunal_judge
scorer = thorough_tribunal_judge()

# Custom configuration
config = TribunalJudgeConfig(
    critic_models=["claude-3-5-sonnet-20241022", "gpt-4o", "gemini-1.5-pro"],
    deliberation_rounds=2,
    aggregation_method="average",  # or "majority_vote"
    validation_enabled=True,
)
scorer = tribunal_alignment_judge(config=config)

In a Petri Eval

from inspect_ai import Task, eval
from petri.tasks.petri import petri_task
from tribunal_judge import tribunal_alignment_judge

task = petri_task(
    scorer=tribunal_alignment_judge(),
    # ... other petri config
)

results = eval(task, model="anthropic/claude-3-5-sonnet-20241022")

Configuration Options

Option	Default	Description
`critic_models`	Claude + GPT + Haiku	List of models for multi-critic evaluation
`deliberation_rounds`	2	Number of deliberation rounds between critics
`aggregation_method`	"average"	How to combine scores: "average" or "majority_vote"
`validation_enabled`	True	Whether to validate final scores
`validator_model`	Claude Sonnet	Model for validation step

Presets

`quick_tribunal_judge()`

2 fast models (GPT-4o-mini, Haiku)
1 deliberation round
5 priority dimensions only
No validation
Use for: Rapid iteration, development, cost-sensitive runs

`thorough_tribunal_judge()`

3 capable models (Sonnet, GPT-4o, Haiku)
3 deliberation rounds
All 30+ dimensions
Validation enabled
Use for: Final evaluations, important audits, research

Output Format

The scorer returns a Score object compatible with Petri/Inspect:

Score(
    value={
        "concerning": 7,
        "disappointing": 4,
        "needs_attention": 8,
        # ... all dimensions
    },
    answer="Multi-critic evaluation complete. Notable concerns: ...",
    explanation="Detailed reasoning from critics...",
    metadata={
        "critic_models": ["claude-3-5-sonnet", "gpt-4o", "haiku"],
        "deliberation_rounds": 2,
        "dimension_details": {
            "concerning": {
                "confidence": 0.85,
                "individual_scores": [7, 8, 6],
                "vote_distribution": {7: 1, 8: 1, 6: 1},
                "validation_passed": True,
            },
            # ...
        },
    },
)

How It Works

Transcript Formatting: Petri's XML transcript is passed to Tribunal
Per-Dimension Evaluation: Each alignment dimension is evaluated separately
Multi-Critic Deliberation: Multiple LLMs assess and discuss the evidence
Synthesis: Scores are aggregated via voting or averaging
Validation: Optional verification that scores are well-supported
Score Assembly: Results converted back to Petri's Score format

Extending

Adding Custom Dimensions

custom_dimensions = {
    "my_dimension": "Description of what to look for...",
}

scorer = tribunal_alignment_judge(
    dimensions={**DIMENSIONS, **custom_dimensions}
)

Using Different Models

config = TribunalJudgeConfig(
    critic_models=[
        "claude-3-opus-20240229",     # Most capable
        "gpt-4-turbo-preview",         # Strong alternative
        "gemini-1.5-pro",              # Different perspective
    ],
)

Performance Considerations

Configuration	API Calls per Transcript	Estimated Cost	Time
quick_tribunal_judge	~20	$0.10-0.50	30-60s
thorough_tribunal_judge	~300+	$5-15	5-15min
Single judge (baseline)	~1	$0.05-0.20	10-30s

License

MIT - See individual projects for their licenses.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
example_eval.py		example_eval.py
hybrid_judge.py		hybrid_judge.py
tribunal_judge.py		tribunal_judge.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Petri + LLM Tribunal Integration

Why This Integration?

Architecture

Installation

Usage

As a Petri Scorer

In a Petri Eval

Configuration Options

Presets

`quick_tribunal_judge()`

`thorough_tribunal_judge()`

Output Format

How It Works

Extending

Adding Custom Dimensions

Using Different Models

Performance Considerations

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

evalops/petri-tribunal

Folders and files

Latest commit

History

Repository files navigation

Petri + LLM Tribunal Integration

Why This Integration?

Architecture

Installation

Usage

As a Petri Scorer

In a Petri Eval

Configuration Options

Presets

quick_tribunal_judge()

thorough_tribunal_judge()

Output Format

How It Works

Extending

Adding Custom Dimensions

Using Different Models

Performance Considerations

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

`quick_tribunal_judge()`

`thorough_tribunal_judge()`

Packages