Multi-critic alignment auditing by combining Petri's alignment auditing framework with LLM Tribunal's multi-critic deliberation system.
Problem: Single-model judges have blind spots, inconsistent scoring, and can miss subtle alignment issues.
Solution: Use multiple LLM critics that deliberate and reach consensus, catching more issues through cross-validation.
| Approach | Pros | Cons |
|---|---|---|
| Single Judge | Fast, cheap | Blind spots, inconsistent |
| Multi-Critic | More robust, catches more | Slower, more expensive |
| This Integration | Best of both | Configurable tradeoff |
┌─────────────────────────────────────────────────────────────┐
│ Petri │
│ ┌─────────┐ ┌─────────┐ ┌─────────────────────────┐ │
│ │ Auditor │───▶│ Target │───▶│ Scorer │ │
│ └─────────┘ └─────────┘ │ ┌─────────────────┐ │ │
│ │ │ tribunal_judge │ │ │
│ │ └────────┬────────┘ │ │
│ └───────────┼────────────┘ │
└────────────────────────────────────────────┼───────────────┘
│
┌────────────────────────▼────────────────────────┐
│ LLM Tribunal │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Critic 1 │ │ Critic 2 │ │ Critic 3 │ │
│ │ (Claude) │ │ (GPT) │ │ (Haiku) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ │ │ │ │
│ └─────────────┼─────────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Deliberation │ │
│ │ (N rounds) │ │
│ └────────┬────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Synthesis │ │
│ │ (vote/average) │ │
│ └────────┬────────┘ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Validation │ │
│ │ (optional) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────┘
# Clone the repos
git clone https://github.com/safety-research/petri.git
git clone https://github.com/evalops/llm-tribunal.git
# Install dependencies
cd petri && pip install -e .
cd ../llm-tribunal && pip install -r requirements.txt
# Set API keys
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...from tribunal_judge import tribunal_alignment_judge, TribunalJudgeConfig
# Quick evaluation (2 critics, 1 round, priority dimensions only)
from tribunal_judge import quick_tribunal_judge
scorer = quick_tribunal_judge()
# Thorough evaluation (3 critics, 3 rounds, all dimensions, validation)
from tribunal_judge import thorough_tribunal_judge
scorer = thorough_tribunal_judge()
# Custom configuration
config = TribunalJudgeConfig(
critic_models=["claude-3-5-sonnet-20241022", "gpt-4o", "gemini-1.5-pro"],
deliberation_rounds=2,
aggregation_method="average", # or "majority_vote"
validation_enabled=True,
)
scorer = tribunal_alignment_judge(config=config)from inspect_ai import Task, eval
from petri.tasks.petri import petri_task
from tribunal_judge import tribunal_alignment_judge
task = petri_task(
scorer=tribunal_alignment_judge(),
# ... other petri config
)
results = eval(task, model="anthropic/claude-3-5-sonnet-20241022")| Option | Default | Description |
|---|---|---|
critic_models |
Claude + GPT + Haiku | List of models for multi-critic evaluation |
deliberation_rounds |
2 | Number of deliberation rounds between critics |
aggregation_method |
"average" | How to combine scores: "average" or "majority_vote" |
validation_enabled |
True | Whether to validate final scores |
validator_model |
Claude Sonnet | Model for validation step |
- 2 fast models (GPT-4o-mini, Haiku)
- 1 deliberation round
- 5 priority dimensions only
- No validation
- Use for: Rapid iteration, development, cost-sensitive runs
- 3 capable models (Sonnet, GPT-4o, Haiku)
- 3 deliberation rounds
- All 30+ dimensions
- Validation enabled
- Use for: Final evaluations, important audits, research
The scorer returns a Score object compatible with Petri/Inspect:
Score(
value={
"concerning": 7,
"disappointing": 4,
"needs_attention": 8,
# ... all dimensions
},
answer="Multi-critic evaluation complete. Notable concerns: ...",
explanation="Detailed reasoning from critics...",
metadata={
"critic_models": ["claude-3-5-sonnet", "gpt-4o", "haiku"],
"deliberation_rounds": 2,
"dimension_details": {
"concerning": {
"confidence": 0.85,
"individual_scores": [7, 8, 6],
"vote_distribution": {7: 1, 8: 1, 6: 1},
"validation_passed": True,
},
# ...
},
},
)- Transcript Formatting: Petri's XML transcript is passed to Tribunal
- Per-Dimension Evaluation: Each alignment dimension is evaluated separately
- Multi-Critic Deliberation: Multiple LLMs assess and discuss the evidence
- Synthesis: Scores are aggregated via voting or averaging
- Validation: Optional verification that scores are well-supported
- Score Assembly: Results converted back to Petri's Score format
custom_dimensions = {
"my_dimension": "Description of what to look for...",
}
scorer = tribunal_alignment_judge(
dimensions={**DIMENSIONS, **custom_dimensions}
)config = TribunalJudgeConfig(
critic_models=[
"claude-3-opus-20240229", # Most capable
"gpt-4-turbo-preview", # Strong alternative
"gemini-1.5-pro", # Different perspective
],
)| Configuration | API Calls per Transcript | Estimated Cost | Time |
|---|---|---|---|
| quick_tribunal_judge | ~20 | $0.10-0.50 | 30-60s |
| thorough_tribunal_judge | ~300+ | $5-15 | 5-15min |
| Single judge (baseline) | ~1 | $0.05-0.20 | 10-30s |
MIT - See individual projects for their licenses.