Skip to content

kash2911/evals_learning-project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OpenAI Evaluations Learning Project

A comprehensive project for learning and implementing OpenAI's evaluation framework. This project demonstrates how to create, run, and analyze evaluations for Large Language Models (LLMs) using the OpenAI Evals framework.

🎯 Project Overview

This project provides hands-on experience with:

  • Creating custom evaluation datasets
  • Building evaluation configurations
  • Running both basic and model-graded evaluations
  • Analyzing evaluation results
  • Understanding different evaluation patterns

📁 Project Structure

eval_learning_project/
├── requirements.txt           # Python dependencies
├── .env.example              # Environment variables template
├── README.md                 # This file
├── setup.py                  # Package setup file
│
├── data/                     # Evaluation datasets
│   ├── basic_evals/         # Basic evaluation datasets
│   ├── model_graded/        # Model-graded evaluation datasets
│   └── custom/              # Custom evaluation datasets
│
├── evals/                    # Evaluation configurations
│   ├── registry/            # Evaluation registry files
│   │   ├── evals/          # YAML eval configurations
│   │   └── modelgraded/    # Model-graded specifications
│   └── custom/              # Custom evaluation classes
│
├── scripts/                  # Utility scripts
│   ├── run_eval.py          # Script to run evaluations
│   ├── generate_data.py     # Generate synthetic eval data
│   └── analyze_results.py   # Analyze evaluation results
│
├── notebooks/               # Jupyter notebooks
│   ├── 01_basic_evals.ipynb
│   ├── 02_model_graded_evals.ipynb
│   └── 03_custom_evals.ipynb
│
└── results/                 # Evaluation results and logs
    ├── logs/               # Evaluation logs
    └── reports/            # Analysis reports

🚀 Quick Start

1. Setup Environment

# Clone or navigate to project directory
cd eval_learning_project

# Install dependencies
pip install -r requirements.txt

# Set up environment variables
cp .env.example .env
# Edit .env and add your OpenAI API key

2. Install OpenAI Evals Framework

# Clone the OpenAI evals repository
git clone https://github.com/openai/evals.git
cd evals

# Install in development mode
pip install -e .

# Fetch evaluation data (requires Git LFS)
git lfs fetch --all
git lfs pull

3. Run Sample Evaluations

# Run a basic evaluation
python scripts/run_eval.py --eval basic_math --model gpt-3.5-turbo

# Run a model-graded evaluation
python scripts/run_eval.py --eval creative_writing --model gpt-3.5-turbo

# Analyze results
python scripts/analyze_results.py --log-file results/logs/latest.jsonl

📊 Evaluation Types

Basic Evaluations

  • Deterministic grading: Exact string matching, regex patterns
  • Use cases: Math problems, factual questions, code syntax validation
  • Examples: Multiple choice questions, simple Q&A

Model-Graded Evaluations

  • LLM-based grading: Uses another model to evaluate responses
  • Use cases: Creative writing, explanations, complex reasoning
  • Examples: Essay grading, code quality assessment

Custom Evaluations

  • Custom logic: Python code for specialized evaluation needs
  • Use cases: Domain-specific requirements, complex scoring algorithms
  • Examples: Code execution, API integration tests

🛠️ Creating Your Own Evaluations

1. Basic Evaluation

  1. Create dataset in JSONL format:
{"input": [{"role": "user", "content": "What is 2+2?"}], "ideal": "4"}
  1. Create YAML configuration:
my_math_eval:
  id: my_math_eval.v1
  metrics: [accuracy]
  description: "Basic math evaluation"

my_math_eval.v1:
  class: evals.elsuite.basic.match:Match
  args:
    samples_jsonl: data/basic_evals/math_problems.jsonl
  1. Run evaluation:
oaieval gpt-3.5-turbo my_math_eval

2. Model-Graded Evaluation

  1. Create dataset with ideal responses:
{
  "input": [{"role": "user", "content": "Write a short story about AI"}],
  "ideal": "A creative, coherent short story with clear narrative structure"
}
  1. Create model-graded specification and YAML config
  2. Run with model grader

📈 Evaluation Metrics

  • Accuracy: Percentage of correct responses
  • Match: Exact string matching
  • Includes: Substring matching
  • Fuzzy Match: Approximate string matching
  • Model-Graded: Custom scoring using LLM judges

🔧 Advanced Features

Completion Functions

  • Custom completion logic for complex systems
  • Support for multi-step workflows
  • Integration with external tools and APIs

Evaluation Templates

  • Pre-built templates for common evaluation patterns
  • Extensible framework for custom evaluation logic
  • Support for various input/output formats

Analysis Tools

  • Detailed logging and reporting
  • Performance metrics and visualizations
  • Failure analysis and debugging tools

📚 Learning Resources

Notebooks

  • 01_basic_evals.ipynb: Introduction to basic evaluations
  • 02_model_graded_evals.ipynb: Working with model-graded evals
  • 03_custom_evals.ipynb: Creating custom evaluation logic

Examples

  • Math problem evaluation
  • Creative writing assessment
  • Code quality evaluation
  • Question answering validation

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add your evaluation or improvement
  4. Test thoroughly
  5. Submit a pull request

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

⚠️ Important Notes

  • API Costs: Be aware of OpenAI API usage costs when running evaluations
  • Rate Limits: Consider API rate limits for large evaluation sets
  • Data Privacy: Ensure evaluation data complies with your organization's privacy policies
  • Model Versions: Different model versions may produce different results

🔗 Useful Links

📞 Support

For questions and support:

  • Open an issue in this repository
  • Check the OpenAI Evals GitHub discussions
  • Review the official documentation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published