Skip to content

ads2280/llm-eval-uncertainty

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Evaluation Uncertainty

Conformal prediction for LLM-as-judge systems to quantify evaluation uncertainty.

What's in this repo

  • LLM-as-judge evaluation using Sonnet 4 for code correctness
  • Conformal prediction with multiple strategies I tested (standard, improved, aggressive)
  • Interactive visualization that shows LLM-as-judge decisions and uncertainty
  • Uncertainty quantification with prediction intervals

Development

Setup

  1. Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Setup project:
uv sync
  1. Set environment variables:
export ANTHROPIC_API_KEY="your-api-key"
export LANGSMITH_API_KEY="your-langsmith-key"
export LANGSMITH_ENDPOINT="https://api.smith.langchain.com"

Usage

Running the Evaluation

# Run the complete evaluation pipeline
uv run python -m llm_eval_uncertainty.main

This will:

  1. Run LLM-as-judge evaluations on code correctness problems
  2. Apply conformal prediction to generate uncertainty intervals
  3. Save results to results.txt

Interactive Visualization

# Launch the Streamlit dashboard
uv run streamlit run llm_eval_uncertainty/viz.py

The dashboard shows:

  • Prediction intervals for each example
  • Coverage statistics
  • Decision breakdowns
  • Interactive exploration of results

Development commands

# Format and lint
uv run ruff format
uv run ruff check

# Add dependencies
uv add package-name

Releases

No releases published

Packages

No packages published