DSLR - Data Science and Logistic Regression

A machine learning project that classifies Hogwarts students into their houses using logistic regression. Built from scratch without using sklearn's logistic regression.

Overview

This project implements a one-vs-all multiclass logistic regression classifier to predict which Hogwarts house a student belongs to based on their course grades.

Features

Custom statistical analysis tool (replicating pandas describe())
Data visualization (histogram, scatter plot, pair plot)
Logistic regression with gradient descent
One-vs-all classification for 4 houses

Installation

make install

Usage

# Run full pipeline
make all

# Or step by step:
make describe   # Show dataset statistics
make train      # Train the model
make predict    # Predict on test data
make plots      # Generate visualizations

Project Structure

dslr/
├── describe.py              # Statistical analysis tool
├── logreg/
│   ├── logreg_train.py      # Training script
│   └── logreg_predict.py    # Prediction script
├── plots/
│   ├── histogram.py         # Histogram visualization
│   ├── scatter_plot.py      # Scatter plot visualization
│   └── pair_plot.py         # Pair plot visualization
├── utils/
│   ├── preprocessing.py     # Data cleaning, scaling, encoding
│   └── statistics.py        # Custom stats functions
├── datasets/
│   ├── dataset_train.csv    # Training data
│   └── dataset_test.csv     # Test data
├── output/                  # Generated outputs
│   ├── histogram.png
│   ├── scatter_plot.png
│   ├── pair_plot.png
│   └── houses.csv
└── weights.npy              # Trained model (generated)

Data Visualization

Histogram

Shows the distribution of "Care of Magical Creatures" scores across all four Hogwarts houses. This feature has a homogeneous distribution, meaning it's not useful for distinguishing between houses.

Scatter Plot

Displays the relationship between Astronomy and Defense Against the Dark Arts. These two features are highly correlated (almost perfectly linear), indicating redundancy.

Pair Plot

A comprehensive view of relationships between selected features (Astronomy, Charms, Potions, Flying) colored by house. Helps identify which features best separate the classes.

How It Works

1. Data Preprocessing

Select relevant numeric features
Fill missing values with column mean
Normalize features using z-score standardization

2. Training (One-vs-All)

Train 4 binary classifiers (one per house)
Use sigmoid activation and gradient descent
Save weights for prediction

3. Prediction

Load trained weights
Compute probability for each house
Assign the house with highest probability

Output

After running make predict, the file output/houses.csv contains:

Index	Hogwarts House
0	Ravenclaw
1	Slytherin
...	...

Clean Up

make clean  # Remove venv and generated files

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
logreg		logreg
output		output
plots		plots
utils		utils
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
describe.py		describe.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DSLR - Data Science and Logistic Regression

Overview

Features

Installation

Usage

Project Structure

Data Visualization

Histogram

Scatter Plot

Pair Plot

How It Works

1. Data Preprocessing

2. Training (One-vs-All)

3. Prediction

Output

Clean Up

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

NarcisseObadiah/dslr

Folders and files

Latest commit

History

Repository files navigation

DSLR - Data Science and Logistic Regression

Overview

Features

Installation

Usage

Project Structure

Data Visualization

Histogram

Scatter Plot

Pair Plot

How It Works

1. Data Preprocessing

2. Training (One-vs-All)

3. Prediction

Output

Clean Up

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages