Add Panel of Judges Support for LLM-based Metrics

Hey LSC team 👋! I just want to say this framework has been really helpful and has made our evaluation work a lot easier. As I’ve been building on top of it, I found a component I’d love to explore further. I’m more than happy to take on the work for this feature if the team thinks it would be useful.

## Description

Currently, the evaluation framework uses a single LLM judge (specified in config/system.yaml) for all LLM-based metric evaluations. This can introduce single-model bias, as different LLMs may have varying strengths, weaknesses, and evaluation tendencies.

## Proposed Solution
Extend the framework to support an optional panel of judges feature that allows running the same evaluation with multiple LLM providers/models simultaneously and aggregating their scores. This would provide more robust, diverse evaluations while maintaining full backward compatibility with existing single-LLM configurations.

```bash
  panel_of_judges:
    enabled: false  # Default: false (uses single LLM from above)

    # Which metric types should use panel evaluation
    apply_to:
      - geval        # Apply to all GEval metrics (if #97 is merged)
      - custom       # Apply to custom LLM metrics
     

    # How to combine scores from multiple judges | Baseline options; more options can be available in the future
    aggregation_method: "mean"  # Options: "mean", "majority_vote" 

    # List of judge configurations
    judges:
      - provider: "openai"
        model: "gpt-4o-mini"
        temperature: 0.0            
        max_tokens: 512             
        timeout: 300                
        num_retries: 3              
        # Perhaps we can autogenerate cache paths via Pydantic if not specified 
        cache_dir: ".caches/panel_cache/openai_gpt4o-mini" 
        cache_enabled: true
      

      - provider: "anthropic"
        model: "claude-3-5-sonnet-20241022"
        temperature: 0.0
        max_tokens: 512             
        timeout: 300                
        num_retries: 3              
        cache_dir: ".caches/panel_cache/anthropic_claude-3-5-sonnet"      
        cache_enabled: true


      - provider: "vertex"
        model: "gemini-2.0-flash"
        temperature: 0.0
        max_tokens: 512             
        timeout: 300                
        num_retries: 3              
        cache_dir: ".caches/panel_cache/vertex_gemini-2.0-flash"      
        cache_enabled: true
```
##   Open Questions

- **Judge Failure**: If one judge fails (API error, timeout), should we:
  - Use partial results from the remaining judges?
  - Fail the entire evaluation?
  - Make it configurable?
- **Score Aggregation**:
  - How should we combine multiple judge reasoning/explanations?
  -  Should we expose individual judge scores and reasons, or only the aggregated result?
- **Output Handling**:
  -  How do panel results appear in graphs?
  - Should we track and report when judges significantly disagree?

I’m more than happy to discuss this implementation in greater detail, walk through potential design options, or collaborate on an approach that fits well with the project’s direction. Thanks for reading! 😸

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Panel of Judges Support for LLM-based Metrics #104

Description

Proposed Solution

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add Panel of Judges Support for LLM-based Metrics #104

Description

Description

Proposed Solution

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions