-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hey LSC team 👋! I just want to say this framework has been really helpful and has made our evaluation work a lot easier. As I’ve been building on top of it, I found a component I’d love to explore further. I’m more than happy to take on the work for this feature if the team thinks it would be useful.
Description
Currently, the evaluation framework uses a single LLM judge (specified in config/system.yaml) for all LLM-based metric evaluations. This can introduce single-model bias, as different LLMs may have varying strengths, weaknesses, and evaluation tendencies.
Proposed Solution
Extend the framework to support an optional panel of judges feature that allows running the same evaluation with multiple LLM providers/models simultaneously and aggregating their scores. This would provide more robust, diverse evaluations while maintaining full backward compatibility with existing single-LLM configurations.
panel_of_judges:
enabled: false # Default: false (uses single LLM from above)
# Which metric types should use panel evaluation
apply_to:
- geval # Apply to all GEval metrics (if #97 is merged)
- custom # Apply to custom LLM metrics
# How to combine scores from multiple judges | Baseline options; more options can be available in the future
aggregation_method: "mean" # Options: "mean", "majority_vote"
# List of judge configurations
judges:
- provider: "openai"
model: "gpt-4o-mini"
temperature: 0.0
max_tokens: 512
timeout: 300
num_retries: 3
# Perhaps we can autogenerate cache paths via Pydantic if not specified
cache_dir: ".caches/panel_cache/openai_gpt4o-mini"
cache_enabled: true
- provider: "anthropic"
model: "claude-3-5-sonnet-20241022"
temperature: 0.0
max_tokens: 512
timeout: 300
num_retries: 3
cache_dir: ".caches/panel_cache/anthropic_claude-3-5-sonnet"
cache_enabled: true
- provider: "vertex"
model: "gemini-2.0-flash"
temperature: 0.0
max_tokens: 512
timeout: 300
num_retries: 3
cache_dir: ".caches/panel_cache/vertex_gemini-2.0-flash"
cache_enabled: trueOpen Questions
- Judge Failure: If one judge fails (API error, timeout), should we:
- Use partial results from the remaining judges?
- Fail the entire evaluation?
- Make it configurable?
- Score Aggregation:
- How should we combine multiple judge reasoning/explanations?
- Should we expose individual judge scores and reasons, or only the aggregated result?
- Output Handling:
- How do panel results appear in graphs?
- Should we track and report when judges significantly disagree?
I’m more than happy to discuss this implementation in greater detail, walk through potential design options, or collaborate on an approach that fits well with the project’s direction. Thanks for reading! 😸