Skip to content

Add Panel of Judges Support for LLM-based Metrics #104

@arin-deloatch

Description

@arin-deloatch

Hey LSC team 👋! I just want to say this framework has been really helpful and has made our evaluation work a lot easier. As I’ve been building on top of it, I found a component I’d love to explore further. I’m more than happy to take on the work for this feature if the team thinks it would be useful.

Description

Currently, the evaluation framework uses a single LLM judge (specified in config/system.yaml) for all LLM-based metric evaluations. This can introduce single-model bias, as different LLMs may have varying strengths, weaknesses, and evaluation tendencies.

Proposed Solution

Extend the framework to support an optional panel of judges feature that allows running the same evaluation with multiple LLM providers/models simultaneously and aggregating their scores. This would provide more robust, diverse evaluations while maintaining full backward compatibility with existing single-LLM configurations.

  panel_of_judges:
    enabled: false  # Default: false (uses single LLM from above)

    # Which metric types should use panel evaluation
    apply_to:
      - geval        # Apply to all GEval metrics (if #97 is merged)
      - custom       # Apply to custom LLM metrics
     

    # How to combine scores from multiple judges | Baseline options; more options can be available in the future
    aggregation_method: "mean"  # Options: "mean", "majority_vote" 

    # List of judge configurations
    judges:
      - provider: "openai"
        model: "gpt-4o-mini"
        temperature: 0.0            
        max_tokens: 512             
        timeout: 300                
        num_retries: 3              
        # Perhaps we can autogenerate cache paths via Pydantic if not specified 
        cache_dir: ".caches/panel_cache/openai_gpt4o-mini" 
        cache_enabled: true
      

      - provider: "anthropic"
        model: "claude-3-5-sonnet-20241022"
        temperature: 0.0
        max_tokens: 512             
        timeout: 300                
        num_retries: 3              
        cache_dir: ".caches/panel_cache/anthropic_claude-3-5-sonnet"      
        cache_enabled: true


      - provider: "vertex"
        model: "gemini-2.0-flash"
        temperature: 0.0
        max_tokens: 512             
        timeout: 300                
        num_retries: 3              
        cache_dir: ".caches/panel_cache/vertex_gemini-2.0-flash"      
        cache_enabled: true

Open Questions

  • Judge Failure: If one judge fails (API error, timeout), should we:
    • Use partial results from the remaining judges?
    • Fail the entire evaluation?
    • Make it configurable?
  • Score Aggregation:
    • How should we combine multiple judge reasoning/explanations?
    • Should we expose individual judge scores and reasons, or only the aggregated result?
  • Output Handling:
    • How do panel results appear in graphs?
    • Should we track and report when judges significantly disagree?

I’m more than happy to discuss this implementation in greater detail, walk through potential design options, or collaborate on an approach that fits well with the project’s direction. Thanks for reading! 😸

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions