Skip to content

Week 3 of LLM Engineering Certification: Learn to fine-tune large language models using OpenAI API, QLoRA, and measure performance improvements with baseline evaluation.

License

Notifications You must be signed in to change notification settings

readytensor/rt-llm-eng-cert-week3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Week 3: Fine-Tuning LLMs

LLM Engineering and Development Certification Program

This repository contains code and materials for Week 3, where we learn to fine-tune large language models and measure improvements over baseline performance.


πŸ“š Week 3 Overview

Goal: Take a base model, fine-tune it using different approaches, and measure improvement.

Lessons Covered

  • Lesson 1: Dataset Selection & Baseline Evaluation
  • Lesson 2: Fine-Tuning Frontier LLMs (OpenAI)
  • Lesson 3: End-to-End LoRA Fine-Tuning
  • Lesson 4: Experiment Tracking & Reproducibility (W&B) (Grid search - in progress)
  • Lessons 5-8: Advanced topics (coming soon)

πŸ—οΈ Repository Structure

.
β”œβ”€β”€ code/
β”‚   β”œβ”€β”€ config.yaml                    # Main configuration file
β”‚   β”œβ”€β”€ paths.py                       # Centralized path management
β”‚   β”‚
β”‚   β”œβ”€β”€ evaluate_baseline.py           # Lesson 1: Baseline evaluation
β”‚   β”œβ”€β”€ train_lora.py                  # Lesson 3: LoRA fine-tuning
β”‚   β”œβ”€β”€ evaluate_lora.py               # Lesson 3: Evaluate fine-tuned model
β”‚   β”‚
β”‚   β”œβ”€β”€ openai_workflow.py             # Lesson 2: OpenAI workflow controller
β”‚   β”œβ”€β”€ openai_workflows/              # Lesson 2: OpenAI fine-tuning scripts
β”‚   β”‚   β”œβ”€β”€ prepare_openai_jsonl.py
β”‚   β”‚   β”œβ”€β”€ openai_finetune_runner.py
β”‚   β”‚   └── evaluate_openai.py
β”‚   β”‚
β”‚   β”œβ”€β”€ run_grid_search.py             # Lesson 4: Grid search (WIP)
β”‚   β”‚
β”‚   └── utils/                         # Shared utilities
β”‚       β”œβ”€β”€ config_utils.py            # Config loading
β”‚       β”œβ”€β”€ data_utils.py              # Dataset loading & preprocessing
β”‚       β”œβ”€β”€ model_utils.py             # Model setup & management
β”‚       └── inference_utils.py         # Generation & evaluation
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ datasets/                      # Cached HuggingFace datasets
β”‚   β”œβ”€β”€ outputs/                       # All evaluation results
β”‚   β”‚   β”œβ”€β”€ baseline/                  # Lesson 1 results
β”‚   β”‚   β”œβ”€β”€ lora_samsum/              # Lesson 3 results
β”‚   β”‚   └── openai/                   # Lesson 2 results
β”‚   └── experiments/                   # OpenAI fine-tuning artifacts
β”‚
β”œβ”€β”€ requirements.txt                   # Python dependencies
└── README.md                          # This file

βš™οΈ Setup

1. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

2. Install Dependencies

pip install -r requirements.txt

3. Configure Environment Variables

Create a .env file in the root directory:

# For OpenAI fine-tuning (Lesson 2)
OPENAI_API_KEY=your_openai_api_key_here

# For Weights & Biases tracking (Lesson 4)
WANDB_API_KEY=your_wandb_api_key_here

# Optional: For Hugging Face model uploads
HF_TOKEN=your_huggingface_token_here

4. Review Configuration

Edit code/config.yaml to customize:

  • Base model (default: meta-llama/Llama-3.2-1B-Instruct)
  • Dataset (default: knkarthick/samsum)
  • Training hyperparameters
  • LoRA configuration

πŸš€ Usage

Lesson 1: Baseline Evaluation

Evaluate the base model (no fine-tuning) to establish baseline performance.

cd code
python evaluate_baseline.py

Output:

  • Results saved to data/outputs/baseline/eval_results.json
  • Predictions saved to data/outputs/baseline/predictions.jsonl

Expected ROUGE-1: ~34% (on SAMSum dataset)


Lesson 2: Fine-Tuning Frontier LLMs (OpenAI)

Complete workflow for fine-tuning OpenAI models like GPT-4o-mini.

Interactive Workflow

cd code
python openai_workflow.py

This launches an interactive menu:

  1. Prepare dataset for fine-tuning
  2. Run fine-tuning job
  3. Evaluate base or fine-tuned model
  4. Exit

Or Run Individual Steps

Step 1: Prepare Data

python openai_workflows/prepare_openai_jsonl.py

Step 2: Create Fine-Tuning Job

python openai_workflows/openai_finetune_runner.py

This will:

  • Upload training/validation files
  • Create fine-tuning job
  • Monitor progress until completion
  • Save fine-tuned model ID

Step 3: Evaluate Base Model

python openai_workflows/evaluate_openai.py --model gpt-4o-mini

Step 4: Evaluate Fine-Tuned Model

python openai_workflows/evaluate_openai.py --model ft:gpt-4o-mini-2024-07-18:your-org:model-name:job-id

Output:

  • Results saved to data/outputs/openai/{model_name}/

Lesson 3: End-to-End LoRA Fine-Tuning

Fine-tune Llama using QLoRA (4-bit quantization + LoRA adapters).

Step 1: Train Model

cd code
python train_lora.py

What happens:

  • Loads base model with 4-bit quantization
  • Applies LoRA adapters to attention layers
  • Fine-tunes on SAMSum dataset
  • Logs metrics to Weights & Biases
  • Saves adapters to data/outputs/lora_samsum/lora_adapters/

Training time: ~15-20 minutes on a single GPU (RTX 3090 / A100)

Step 2: Evaluate Fine-Tuned Model

python evaluate_lora.py

Output:

  • Results saved to data/outputs/lora_samsum/eval_results.json
  • Predictions saved to data/outputs/lora_samsum/predictions.jsonl

Expected improvement: ROUGE-1 should increase by ~5-10% over baseline


πŸ”§ Configuration

All configuration is centralized in code/config.yaml:

Change the Base Model

base_model: meta-llama/Llama-3.2-3B-Instruct # or any HF model

Change the Dataset

datasets:
  - path: your-org/your-dataset
    cache_dir: ../data/datasets
    field_map:
      input: dialogue # Your input field name
      output: summary # Your output field name
    type: completion

Adjust Training Hyperparameters

num_epochs: 3
learning_rate: 2e-4
batch_size: 4
gradient_accumulation_steps: 4

Modify LoRA Configuration

lora_r: 8 # Rank (higher = more parameters)
lora_alpha: 16 # Scaling factor
lora_dropout: 0.1
target_modules: ["q_proj", "v_proj", "k_proj", "o_proj"]

πŸ“Š Results Comparison

After completing lessons 1-3, compare results:

Model ROUGE-1 ROUGE-2 ROUGE-L
Baseline (Lesson 1) ~34% ~12% ~27%
OpenAI GPT-4o-mini (Lesson 2) ~41% ~16% ~32%
Fine-tuned GPT-4o-mini (Lesson 2) ~53% ~28% ~45%
Fine-tuned Llama LoRA (Lesson 3) TBD TBD TBD

Run each lesson to populate your own results!


πŸ§ͺ Lesson 4: Grid Search (Work in Progress)

# Note: This script is not yet verified
python run_grid_search.py

This will:

  • Systematically test different LoRA hyperparameters
  • Log all experiments to Weights & Biases
  • Save results for comparison

🀝 Contributing

This is an educational repository. Feel free to:

  • Open issues for bugs or questions
  • Submit PRs for improvements
  • Share your fine-tuning results!

πŸ“„ License

This project is licensed under the CC BY-NC-SA 4.0 License - see the LICENSE file for details.

Contact

Ready Tensor, Inc.

  • Email: contact at readytensor dot com
  • Issues & Contributions: Open an issue or pull request on this repository
  • Website: Ready Tensor

About

Week 3 of LLM Engineering Certification: Learn to fine-tune large language models using OpenAI API, QLoRA, and measure performance improvements with baseline evaluation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •