Project: Boosting Trees

Introduction

This project implements a Gradient Boosting Tree Classifier from scratch, focusing on binary classification problems. Gradient Boosting is a powerful ensemble learning technique that builds multiple weak learners (decision stumps in our implementation) sequentially, with each new model correcting the errors of the previous ones. Unlike pre‑existing libraries such as scikit‑learn, our implementation builds the model from first principles, offering a deeper understanding of Gradient Boosting mechanics.

The implementation uses decision stumps (single‑level decision trees) as base learners and optimizes them using gradient descent on the loss function. Our approach specifically addresses binary classification using logistic loss, making it suitable for problems where the goal is to separate data into two distinct classes.

Project Overview

The primary objective of this project is to develop and test a Gradient Boosting Tree Classifier that can handle non‑linear decision boundaries and complex classification problems. Gradient Boosting is particularly valuable when the relationship between features and target variables is non‑linear and complex, as it can capture intricate patterns through the sequential addition of weak learners.

The run_gradient_boosting.py script provides an end‑to‑end workflow that:

Loads a dataset (generated or user‑supplied CSV).
Shuffles and normalizes the data.
Splits the data into training and test sets.
Trains the Gradient Boosting model.
Evaluates performance on training and test sets.
Produces visualizations to assess model fit.

Typical use‑cases include spam detection, customer‑churn prediction, or binary medical diagnosis.

Model Description

The Gradient Boosting Tree Classifier is an ensemble of decision stumps. Each stump makes a simple decision based on a single feature and threshold; the ensemble combines these decisions to create a powerful classifier.

Loss and Gradient

Heads-up: If the equations below appear as plain text in VS Code, install a Markdown extension with MathJax support (e.g. Markdown Preview Enhanced or Markdown+Math).

For binary labels ( y \in {-1,1} ) we minimise the logistic loss

$$ \mathcal{L}(y, F(x)) ;=; \log\Bigl(1 + e^{-y,F(x)}\Bigr) $$

The corresponding negative gradient (our residuals) is

$$ \tilde r ;=; -\frac{\partial \mathcal{L}}{\partial F(x)} ;=; \frac{y}{1 + e^{y,F(x)}}. $$

Algorithm

Initialisation: ( F_0(x) = 0 ).
For ( m = 1, \dots, M ):
1. Compute residuals ( \tilde r^{(m)} ).
2. Fit a decision stump ( h_m(x) ) to the residuals.
3. Update the ensemble
  ( F_m(x) ;=; F_{m-1}(x) + \eta,h_m(x) )
  where ( \eta ) is the learning-rate.
4. If accuracy hasn’t improved for 20 rounds, trigger early stopping.

The DecisionStump class enumerates candidate splits for every feature-threshold pair, chooses the split that minimises squared error of residuals, and predicts a constant value on each side of the split.

Running the Code

Run everything from VS Code’s integrated terminal (Ctrl + `). The commands assume you’re at the project root.

1. Create & activate a virtual environment

python -m venv venv        # create
# Windows
venv\Scripts\activate
# macOS / Linux
source venv/bin/activate

2. Install dependencies

pip install -r requirements.txt

(The list is short: numpy, matplotlib, pytest.)

3. Generate a synthetic dataset (optional)

python generate_binary_data.py --samples 600 --noise 0.25 --visualize

This saves hard_test.csv under data/ and Figure_5.png under Output_Images/.

4. Train & evaluate the model

python run_gradient_boosting.py

You’ll see console metrics, and the five figures in Output_Images/ will refresh.

5. Run the automated test‑suite

pytest tests/test_GradientBoosting.py -v

All seven tests should pass in approximately 0.4 seconds.

Testing the Model

The automated tests in tests/test_GradientBoosting.py verify core functionality, edge‑cases, and baseline accuracy.

Detailed descriptions

test_training – End‑to‑end sanity check: .fit() runs, .predict() returns binary labels with correct shape.
test_accuracy_on_simple_data – Learns a trivial threshold; expects ≥ 80 % accuracy.
test_all_same_class – If all labels are 0, the model should predict only 0s.
test_binary_threshold_behavior – On a clear 0/1 split, accuracy must be ≥ 90 %.
test_random_data_stability – 100 random points with a synthetic rule; accuracy ≥ 85 %.
test_zero_features – Handles an all‑zero feature matrix; maintains output shape, still binary.
test_predict_before_fit – Calling .predict() before .fit() must raise an error.

Summary table

#	Test Name	Focus	Pass Criterion
1	`test_training`	Fit & predict	Correct shape; binary labels
2	`test_accuracy_on_simple_data`	Simple data accuracy	≥ 0.80 accuracy
3	`test_all_same_class`	Single‑class handling	All preds == 0
4	`test_binary_threshold_behavior`	Threshold learning	≥ 0.90 accuracy
5	`test_random_data_stability`	Random‑data robustness	≥ 0.85 accuracy
6	`test_zero_features`	Zero‑matrix input	Shape intact; binary preds
7	`test_predict_before_fit`	API misuse guard	Raises exception

All seven tests pass on Python 3.10 (≈ 0.4 s).

Proof‑of‑pass screenshot

VS Code terminal output after running pytest tests/test_GradientBoosting.py -v.

------|---------| | test_training | End‑to‑end fit + predict sanity check | | test_accuracy_on_simple_data | Learns a linearly‑separable dataset | | test_all_same_class | Handles degenerate single‑class data | | test_binary_threshold_behavior | Correct threshold learning | | test_random_data_stability | Stability on pseudo‑random patterns | | test_zero_features | Graceful behaviour with zero‑vectors | | test_predict_before_fit | Raises error when .predict() called first |

All seven tests currently pass (≈0.4 s on Python 3.10).

Tunable Parameters

Parameter	Default	Effect
`n_estimators`	100	Number of decision stumps. Larger = higher capacity, longer training
`learning_rate`	0.1	Contribution of each stump. Lower values often generalise better
`early_stopping_rounds`	hard‑coded 20	Stop if accuracy plateaus; prevents over‑fitting

Example:

clf = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05)
clf.fit(X_train, y_train)

Potential Issues

Imbalanced classes → consider class‑weights or resampling.
High‑dimensional data → feature‑subsampling or deeper trees recommended.
Noisy labels → risk of over‑fitting; tune learning_rate + early stopping.
Multi‑class → not supported yet; would need one‑vs‑rest or softmax loss.

Output and Visualizations

Typical Console Output

Class 0 count: 249
Class 1 count: 251
Early stopping at round 37 with best accuracy: 0.88

Training Accuracy: 0.88
Test Accuracy:    0.86
Class 0 Accuracy: 0.79
Class 1 Accuracy: 0.94

✅ Matched 86 out of 100 test samples
🔍 Prediction Error: 0.14

Visuals Produced

All figures generated by run_gradient_boosting.py are saved under Output_Images/. The previews below are embedded straight from that folder so you can view them in VS Code or any Markdown renderer as long as the images stay in the same relative location.

1. Decision Boundary — `Figure_1.png`

What it shows: A colour‑coded contour of the classifier’s predictions over the feature space with the training samples overlaid.
Why it matters: Lets you confirm at a glance that the learned boundary is non‑linear and that most points fall on the correct side.

2. Accuracy vs Iteration — `Figure_2.png`

What it shows: Training‑set accuracy after each boosting round.
Why it matters: Reveals convergence speed and highlights any over‑fitting if accuracy plateaus or dips.

3. Train‑set Predictions — `Figure_3.png`

What it shows: Predicted label versus actual label for every training sample.
Why it matters: Points hugging the red y = x line indicate the ensemble fits the training data well.

4. Test‑set Predictions — `Figure_4.png`

What it shows: Predicted vs actual labels on the held‑out test split.
Why it matters: A tight diagonal cluster demonstrates good generalisation to unseen data.

5. Generated Dataset — `Figure_5.png`

What it shows: The raw synthetic dataset coloured by ground‑truth class before any training occurs.
Why it matters: Provides context for class overlap and the intrinsic complexity of the problem.

Question and Answers

1. What does the model do, and when should you choose it?

The Gradient Boosting Tree Classifier incrementally combines a sequence of very simple models (decision stumps) into a single strong learner. Each stump focuses on the mistakes of the current ensemble, so the model gradually “pushes down” the residual error until it captures complex, highly non‑linear patterns.

When it shines

Scenario	Why Boosting Helps
Feature–target relationship is non‑linear	Logistic regression or linear SVM under‑fit; boosting learns curved boundaries without hand‑ crafted features.
Tabular data with modest size (10²‑10⁵ rows)	Decision‑tree ensembles handle numerical and categorical inputs with minimal preprocessing.
Need for interpretable feature importance	You can rank features by their cumulative split‑gain or permutation score.
Imbalanced classes	By adjusting the loss gradient or sampling, boosting keeps learning from minority‑class errors.
Modest training time budget	A few hundred stumps train in seconds‑to‑minutes on a laptop; no GPU required.

When to avoid

Massive (>10⁶ row) datasets where a single pass algorithm (SGD, linear models) is preferred.
Purely image/audio/text tasks where a convolutional or transformer model clearly dominates.

2. **How did we validate that the implementation is correct and useful?**

Seven unit tests (see tests/test_GradientBoosting.py) — cover input validation, edge‑cases, and numeric agreement with ground‑truth labels.
Hold‑out accuracy & confusion matrix — assure generalisation performance; monitored both overall and per‑class accuracy.
Learning‑curve diagnostics — plotted training accuracy each boosting round to verify monotonic improvement and detect over‑fitting early.
Decision‑boundary sanity check — visual alignment between contour regions and data clusters.
Random‑seed reproducibility — deterministic results given a seed, enabling fair A/B comparisons.
Stress tests — extremely small (n<10) and synthetic XOR datasets confirm the stump splitter and gradient update logic.

Together these layers of testing give high confidence that bugs in tree building, gradient calculations, or early‑stopping logic would surface quickly.

3. What hyper‑parameters can you tune, and how do they interact?

Parameter	Typical Range	Practical Effect
`n_estimators`	50 – 500	More stumps = higher capacity; too many → risk of over‑fitting unless you lower `learning_rate`.
`learning_rate`	0.01 – 0.3	Shrinks each stump’s contribution. Small values need more trees but almost always generalise better.
`early_stopping_rounds`	10 – 50	Stops training when the chosen metric hasn’t improved; a safety‑valve for large `n_estimators`.

Tuning recipe

Start with learning_rate = 0.1, n_estimators = 200.
Perform a small grid search on (learning_rate, n_estimators) keeping the product roughly constant (e.g. (0.05, 400), (0.2, 100)).
Enable early_stopping_rounds (≥20) to save compute while preventing over‑fit.
If classes are imbalanced, experiment with sample‑weighting or gradient scaling.

4. What input types still cause headaches, and what could be done?

Challenge	Why it’s hard	Potential Mitigation
Extreme class imbalance (>1:100)	Gradient updates focus on majority‑class error.	Use class weights, focal loss, or SMOTE resampling.
Many (>1 000) features	Stump splitter scans every feature ⇒ O(n_features × n_samples).	Feature subsampling per tree (like Random Forest) or pre‑filter low‑variance features.
Highly noisy labels	Boosting can over‑fit the noise because it keeps forcing down residuals.	Lower `learning_rate`, add stronger `L2` shrinkage, or switch to robust loss (Huber, quantile).
Multi‑class problems (k>2)	Current implementation is binary (logistic loss).	Extend to one‑vs‑rest or softmax loss with vector‑valued residuals.
Streaming / huge datasets	Algorithm is batch; memory grows with dataset.	Implement staged training with mini‑batch gradients or use online‑boosting variants.

While some of these fixes are incremental engineering tasks (feature subsampling, weighted gradients), others (true online boosting, multi‑class softmax) would require non‑trivial extensions to the current codebase.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
Output_Images		Output_Images
data		data
model		model
tests		tests
visualizations		visualizations
.gitignore		.gitignore
README.md		README.md
generate_binary_data.py		generate_binary_data.py
generate_classification_data.py		generate_classification_data.py
requirements.txt		requirements.txt
run_gradient_boosting.py		run_gradient_boosting.py

Comprehenderr/Ml_Project_Boosting_Trees

Folders and files

Latest commit

History

Repository files navigation

Project: Boosting Trees

Table of Contents

Introduction

Project Overview

Model Description

Loss and Gradient

Algorithm

Running the Code

1. Create & activate a virtual environment

2. Install dependencies

3. Generate a synthetic dataset (optional)

4. Train & evaluate the model

5. Run the automated test‑suite

Testing the Model

Detailed descriptions

Summary table

Proof‑of‑pass screenshot

Tunable Parameters

Potential Issues

Output and Visualizations

Typical Console Output

Visuals Produced

1. Decision Boundary — Figure_1.png

2. Accuracy vs Iteration — Figure_2.png

3. Train‑set Predictions — Figure_3.png

4. Test‑set Predictions — Figure_4.png

5. Generated Dataset — Figure_5.png

Question and Answers

1. What does the model do, and when should you choose it?

2. How did we validate that the implementation is correct and useful?

3. What hyper‑parameters can you tune, and how do they interact?

4. What input types still cause headaches, and what could be done?

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0