OpenAI Evaluations Learning Project

A comprehensive project for learning and implementing OpenAI's evaluation framework. This project demonstrates how to create, run, and analyze evaluations for Large Language Models (LLMs) using the OpenAI Evals framework.

🎯 Project Overview

This project provides hands-on experience with:

Creating custom evaluation datasets
Building evaluation configurations
Running both basic and model-graded evaluations
Analyzing evaluation results
Understanding different evaluation patterns

📁 Project Structure

eval_learning_project/
├── requirements.txt           # Python dependencies
├── .env.example              # Environment variables template
├── README.md                 # This file
├── setup.py                  # Package setup file
│
├── data/                     # Evaluation datasets
│   ├── basic_evals/         # Basic evaluation datasets
│   ├── model_graded/        # Model-graded evaluation datasets
│   └── custom/              # Custom evaluation datasets
│
├── evals/                    # Evaluation configurations
│   ├── registry/            # Evaluation registry files
│   │   ├── evals/          # YAML eval configurations
│   │   └── modelgraded/    # Model-graded specifications
│   └── custom/              # Custom evaluation classes
│
├── scripts/                  # Utility scripts
│   ├── run_eval.py          # Script to run evaluations
│   ├── generate_data.py     # Generate synthetic eval data
│   └── analyze_results.py   # Analyze evaluation results
│
├── notebooks/               # Jupyter notebooks
│   ├── 01_basic_evals.ipynb
│   ├── 02_model_graded_evals.ipynb
│   └── 03_custom_evals.ipynb
│
└── results/                 # Evaluation results and logs
    ├── logs/               # Evaluation logs
    └── reports/            # Analysis reports

🚀 Quick Start

1. Setup Environment

# Clone or navigate to project directory
cd eval_learning_project

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env and add your OpenAI API key

2. Install OpenAI Evals Framework

# Clone the OpenAI evals repository
git clone https://github.com/openai/evals.git
cd evals

# Install in development mode
pip install -e .

# Fetch evaluation data (requires Git LFS)
git lfs fetch --all
git lfs pull

3. Run Sample Evaluations

# Run a basic evaluation
python scripts/run_eval.py --eval basic_math --model gpt-3.5-turbo

# Run a model-graded evaluation
python scripts/run_eval.py --eval creative_writing --model gpt-3.5-turbo

# Analyze results
python scripts/analyze_results.py --log-file results/logs/latest.jsonl

📊 Evaluation Types

Basic Evaluations

Deterministic grading: Exact string matching, regex patterns
Use cases: Math problems, factual questions, code syntax validation
Examples: Multiple choice questions, simple Q&A

Model-Graded Evaluations

LLM-based grading: Uses another model to evaluate responses
Use cases: Creative writing, explanations, complex reasoning
Examples: Essay grading, code quality assessment

Custom Evaluations

Custom logic: Python code for specialized evaluation needs
Use cases: Domain-specific requirements, complex scoring algorithms
Examples: Code execution, API integration tests

🛠️ Creating Your Own Evaluations

1. Basic Evaluation

Create dataset in JSONL format:

{"input": [{"role": "user", "content": "What is 2+2?"}], "ideal": "4"}

Create YAML configuration:

my_math_eval:
  id: my_math_eval.v1
  metrics: [accuracy]
  description: "Basic math evaluation"

my_math_eval.v1:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: data/basic_evals/math_problems.jsonl

Run evaluation:

oaieval gpt-3.5-turbo my_math_eval

2. Model-Graded Evaluation

Create dataset with ideal responses:

{
  "input": [{"role": "user", "content": "Write a short story about AI"}],
  "ideal": "A creative, coherent short story with clear narrative structure"
}

Create model-graded specification and YAML config
Run with model grader

📈 Evaluation Metrics

Accuracy: Percentage of correct responses
Match: Exact string matching
Includes: Substring matching
Fuzzy Match: Approximate string matching
Model-Graded: Custom scoring using LLM judges

🔧 Advanced Features

Completion Functions

Custom completion logic for complex systems
Support for multi-step workflows
Integration with external tools and APIs

Evaluation Templates

Pre-built templates for common evaluation patterns
Extensible framework for custom evaluation logic
Support for various input/output formats

Analysis Tools

Detailed logging and reporting
Performance metrics and visualizations
Failure analysis and debugging tools

📚 Learning Resources

Notebooks

01_basic_evals.ipynb: Introduction to basic evaluations
02_model_graded_evals.ipynb: Working with model-graded evals
03_custom_evals.ipynb: Creating custom evaluation logic

Examples

Math problem evaluation
Creative writing assessment
Code quality evaluation
Question answering validation

🤝 Contributing

Fork the repository
Create a feature branch
Add your evaluation or improvement
Test thoroughly
Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Important Notes

API Costs: Be aware of OpenAI API usage costs when running evaluations
Rate Limits: Consider API rate limits for large evaluation sets
Data Privacy: Ensure evaluation data complies with your organization's privacy policies
Model Versions: Different model versions may produce different results

🔗 Useful Links

📞 Support

For questions and support:

Open an issue in this repository
Check the OpenAI Evals GitHub discussions
Review the official documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

OpenAI Evaluations Learning Project

🎯 Project Overview

📁 Project Structure

🚀 Quick Start

1. Setup Environment

2. Install OpenAI Evals Framework

3. Run Sample Evaluations

📊 Evaluation Types

Basic Evaluations

Model-Graded Evaluations

Custom Evaluations

🛠️ Creating Your Own Evaluations

1. Basic Evaluation

2. Model-Graded Evaluation

📈 Evaluation Metrics

🔧 Advanced Features

Completion Functions

Evaluation Templates

Analysis Tools

📚 Learning Resources

Notebooks

Examples

🤝 Contributing

📝 License

⚠️ Important Notes

🔗 Useful Links

📞 Support

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
evals		evals
notebooks		notebooks
scripts		scripts
.gitignore		.gitignore
GETTING_STARTED.md		GETTING_STARTED.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

kash2911/evals_learning-project

Folders and files

Latest commit

History

Repository files navigation

OpenAI Evaluations Learning Project

🎯 Project Overview

📁 Project Structure

🚀 Quick Start

1. Setup Environment

2. Install OpenAI Evals Framework

3. Run Sample Evaluations

📊 Evaluation Types

Basic Evaluations

Model-Graded Evaluations

Custom Evaluations

🛠️ Creating Your Own Evaluations

1. Basic Evaluation

2. Model-Graded Evaluation

📈 Evaluation Metrics

🔧 Advanced Features

Completion Functions

Evaluation Templates

Analysis Tools

📚 Learning Resources

Notebooks

Examples

🤝 Contributing

📝 License

⚠️ Important Notes

🔗 Useful Links

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages