LLM Engineering and Development Certification Program
This repository contains code and materials for Week 3, where we learn to fine-tune large language models and measure improvements over baseline performance.
Goal: Take a base model, fine-tune it using different approaches, and measure improvement.
- Lesson 1: Dataset Selection & Baseline Evaluation
- Lesson 2: Fine-Tuning Frontier LLMs (OpenAI)
- Lesson 3: End-to-End LoRA Fine-Tuning
- Lesson 4: Experiment Tracking & Reproducibility (W&B) (Grid search - in progress)
- Lessons 5-8: Advanced topics (coming soon)
.
βββ code/
β βββ config.yaml # Main configuration file
β βββ paths.py # Centralized path management
β β
β βββ evaluate_baseline.py # Lesson 1: Baseline evaluation
β βββ train_lora.py # Lesson 3: LoRA fine-tuning
β βββ evaluate_lora.py # Lesson 3: Evaluate fine-tuned model
β β
β βββ openai_workflow.py # Lesson 2: OpenAI workflow controller
β βββ openai_workflows/ # Lesson 2: OpenAI fine-tuning scripts
β β βββ prepare_openai_jsonl.py
β β βββ openai_finetune_runner.py
β β βββ evaluate_openai.py
β β
β βββ run_grid_search.py # Lesson 4: Grid search (WIP)
β β
β βββ utils/ # Shared utilities
β βββ config_utils.py # Config loading
β βββ data_utils.py # Dataset loading & preprocessing
β βββ model_utils.py # Model setup & management
β βββ inference_utils.py # Generation & evaluation
β
βββ data/
β βββ datasets/ # Cached HuggingFace datasets
β βββ outputs/ # All evaluation results
β β βββ baseline/ # Lesson 1 results
β β βββ lora_samsum/ # Lesson 3 results
β β βββ openai/ # Lesson 2 results
β βββ experiments/ # OpenAI fine-tuning artifacts
β
βββ requirements.txt # Python dependencies
βββ README.md # This file
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtCreate a .env file in the root directory:
# For OpenAI fine-tuning (Lesson 2)
OPENAI_API_KEY=your_openai_api_key_here
# For Weights & Biases tracking (Lesson 4)
WANDB_API_KEY=your_wandb_api_key_here
# Optional: For Hugging Face model uploads
HF_TOKEN=your_huggingface_token_hereEdit code/config.yaml to customize:
- Base model (default:
meta-llama/Llama-3.2-1B-Instruct) - Dataset (default:
knkarthick/samsum) - Training hyperparameters
- LoRA configuration
Evaluate the base model (no fine-tuning) to establish baseline performance.
cd code
python evaluate_baseline.pyOutput:
- Results saved to
data/outputs/baseline/eval_results.json - Predictions saved to
data/outputs/baseline/predictions.jsonl
Expected ROUGE-1: ~34% (on SAMSum dataset)
Complete workflow for fine-tuning OpenAI models like GPT-4o-mini.
cd code
python openai_workflow.pyThis launches an interactive menu:
- Prepare dataset for fine-tuning
- Run fine-tuning job
- Evaluate base or fine-tuned model
- Exit
Step 1: Prepare Data
python openai_workflows/prepare_openai_jsonl.pyStep 2: Create Fine-Tuning Job
python openai_workflows/openai_finetune_runner.pyThis will:
- Upload training/validation files
- Create fine-tuning job
- Monitor progress until completion
- Save fine-tuned model ID
Step 3: Evaluate Base Model
python openai_workflows/evaluate_openai.py --model gpt-4o-miniStep 4: Evaluate Fine-Tuned Model
python openai_workflows/evaluate_openai.py --model ft:gpt-4o-mini-2024-07-18:your-org:model-name:job-idOutput:
- Results saved to
data/outputs/openai/{model_name}/
Fine-tune Llama using QLoRA (4-bit quantization + LoRA adapters).
cd code
python train_lora.pyWhat happens:
- Loads base model with 4-bit quantization
- Applies LoRA adapters to attention layers
- Fine-tunes on SAMSum dataset
- Logs metrics to Weights & Biases
- Saves adapters to
data/outputs/lora_samsum/lora_adapters/
Training time: ~15-20 minutes on a single GPU (RTX 3090 / A100)
python evaluate_lora.pyOutput:
- Results saved to
data/outputs/lora_samsum/eval_results.json - Predictions saved to
data/outputs/lora_samsum/predictions.jsonl
Expected improvement: ROUGE-1 should increase by ~5-10% over baseline
All configuration is centralized in code/config.yaml:
base_model: meta-llama/Llama-3.2-3B-Instruct # or any HF modeldatasets:
- path: your-org/your-dataset
cache_dir: ../data/datasets
field_map:
input: dialogue # Your input field name
output: summary # Your output field name
type: completionnum_epochs: 3
learning_rate: 2e-4
batch_size: 4
gradient_accumulation_steps: 4lora_r: 8 # Rank (higher = more parameters)
lora_alpha: 16 # Scaling factor
lora_dropout: 0.1
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]After completing lessons 1-3, compare results:
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L |
|---|---|---|---|
| Baseline (Lesson 1) | ~34% | ~12% | ~27% |
| OpenAI GPT-4o-mini (Lesson 2) | ~41% | ~16% | ~32% |
| Fine-tuned GPT-4o-mini (Lesson 2) | ~53% | ~28% | ~45% |
| Fine-tuned Llama LoRA (Lesson 3) | TBD | TBD | TBD |
Run each lesson to populate your own results!
# Note: This script is not yet verified
python run_grid_search.pyThis will:
- Systematically test different LoRA hyperparameters
- Log all experiments to Weights & Biases
- Save results for comparison
This is an educational repository. Feel free to:
- Open issues for bugs or questions
- Submit PRs for improvements
- Share your fine-tuning results!
This project is licensed under the CC BY-NC-SA 4.0 License - see the LICENSE file for details.
Ready Tensor, Inc.
- Email: contact at readytensor dot com
- Issues & Contributions: Open an issue or pull request on this repository
- Website: Ready Tensor