A comprehensive project for learning and implementing OpenAI's evaluation framework. This project demonstrates how to create, run, and analyze evaluations for Large Language Models (LLMs) using the OpenAI Evals framework.
This project provides hands-on experience with:
- Creating custom evaluation datasets
- Building evaluation configurations
- Running both basic and model-graded evaluations
- Analyzing evaluation results
- Understanding different evaluation patterns
eval_learning_project/
├── requirements.txt # Python dependencies
├── .env.example # Environment variables template
├── README.md # This file
├── setup.py # Package setup file
│
├── data/ # Evaluation datasets
│ ├── basic_evals/ # Basic evaluation datasets
│ ├── model_graded/ # Model-graded evaluation datasets
│ └── custom/ # Custom evaluation datasets
│
├── evals/ # Evaluation configurations
│ ├── registry/ # Evaluation registry files
│ │ ├── evals/ # YAML eval configurations
│ │ └── modelgraded/ # Model-graded specifications
│ └── custom/ # Custom evaluation classes
│
├── scripts/ # Utility scripts
│ ├── run_eval.py # Script to run evaluations
│ ├── generate_data.py # Generate synthetic eval data
│ └── analyze_results.py # Analyze evaluation results
│
├── notebooks/ # Jupyter notebooks
│ ├── 01_basic_evals.ipynb
│ ├── 02_model_graded_evals.ipynb
│ └── 03_custom_evals.ipynb
│
└── results/ # Evaluation results and logs
├── logs/ # Evaluation logs
└── reports/ # Analysis reports
# Clone or navigate to project directory
cd eval_learning_project
# Install dependencies
pip install -r requirements.txt
# Set up environment variables
cp .env.example .env
# Edit .env and add your OpenAI API key# Clone the OpenAI evals repository
git clone https://github.com/openai/evals.git
cd evals
# Install in development mode
pip install -e .
# Fetch evaluation data (requires Git LFS)
git lfs fetch --all
git lfs pull# Run a basic evaluation
python scripts/run_eval.py --eval basic_math --model gpt-3.5-turbo
# Run a model-graded evaluation
python scripts/run_eval.py --eval creative_writing --model gpt-3.5-turbo
# Analyze results
python scripts/analyze_results.py --log-file results/logs/latest.jsonl- Deterministic grading: Exact string matching, regex patterns
- Use cases: Math problems, factual questions, code syntax validation
- Examples: Multiple choice questions, simple Q&A
- LLM-based grading: Uses another model to evaluate responses
- Use cases: Creative writing, explanations, complex reasoning
- Examples: Essay grading, code quality assessment
- Custom logic: Python code for specialized evaluation needs
- Use cases: Domain-specific requirements, complex scoring algorithms
- Examples: Code execution, API integration tests
- Create dataset in JSONL format:
{"input": [{"role": "user", "content": "What is 2+2?"}], "ideal": "4"}- Create YAML configuration:
my_math_eval:
id: my_math_eval.v1
metrics: [accuracy]
description: "Basic math evaluation"
my_math_eval.v1:
class: evals.elsuite.basic.match:Match
args:
samples_jsonl: data/basic_evals/math_problems.jsonl- Run evaluation:
oaieval gpt-3.5-turbo my_math_eval- Create dataset with ideal responses:
{
"input": [{"role": "user", "content": "Write a short story about AI"}],
"ideal": "A creative, coherent short story with clear narrative structure"
}- Create model-graded specification and YAML config
- Run with model grader
- Accuracy: Percentage of correct responses
- Match: Exact string matching
- Includes: Substring matching
- Fuzzy Match: Approximate string matching
- Model-Graded: Custom scoring using LLM judges
- Custom completion logic for complex systems
- Support for multi-step workflows
- Integration with external tools and APIs
- Pre-built templates for common evaluation patterns
- Extensible framework for custom evaluation logic
- Support for various input/output formats
- Detailed logging and reporting
- Performance metrics and visualizations
- Failure analysis and debugging tools
01_basic_evals.ipynb: Introduction to basic evaluations02_model_graded_evals.ipynb: Working with model-graded evals03_custom_evals.ipynb: Creating custom evaluation logic
- Math problem evaluation
- Creative writing assessment
- Code quality evaluation
- Question answering validation
- Fork the repository
- Create a feature branch
- Add your evaluation or improvement
- Test thoroughly
- Submit a pull request
This project is licensed under the MIT License - see the LICENSE file for details.
- API Costs: Be aware of OpenAI API usage costs when running evaluations
- Rate Limits: Consider API rate limits for large evaluation sets
- Data Privacy: Ensure evaluation data complies with your organization's privacy policies
- Model Versions: Different model versions may produce different results
For questions and support:
- Open an issue in this repository
- Check the OpenAI Evals GitHub discussions
- Review the official documentation