Conformal prediction for LLM-as-judge systems to quantify evaluation uncertainty.
- LLM-as-judge evaluation using Sonnet 4 for code correctness
- Conformal prediction with multiple strategies I tested (standard, improved, aggressive)
- Interactive visualization that shows LLM-as-judge decisions and uncertainty
- Uncertainty quantification with prediction intervals
- Install uv:
curl -LsSf https://astral.sh/uv/install.sh | sh- Setup project:
uv sync- Set environment variables:
export ANTHROPIC_API_KEY="your-api-key"
export LANGSMITH_API_KEY="your-langsmith-key"
export LANGSMITH_ENDPOINT="https://api.smith.langchain.com"# Run the complete evaluation pipeline
uv run python -m llm_eval_uncertainty.mainThis will:
- Run LLM-as-judge evaluations on code correctness problems
- Apply conformal prediction to generate uncertainty intervals
- Save results to
results.txt
# Launch the Streamlit dashboard
uv run streamlit run llm_eval_uncertainty/viz.pyThe dashboard shows:
- Prediction intervals for each example
- Coverage statistics
- Decision breakdowns
- Interactive exploration of results
# Format and lint
uv run ruff format
uv run ruff check
# Add dependencies
uv add package-name