🧠 Machine Learning Foundations – Statistics, Probability & Data Preparation

This repository contains foundational learning materials to help you understand data before training machine learning models.
If we don’t understand data first, our models may become biased, unstable, or misleading — so statistics is the first step.

📌 What is Statistics?

Statistics is the science of collecting, summarizing, analyzing, and interpreting data.

Example:
If we have students' exam scores:

Mean → Average score
Median → Middle score
Standard Deviation → How spread out the scores are

Statistics helps us find meaning in data.

🤖 Why Statistics is Important for Machine Learning

Machine learning models learn patterns from data.
If the data has:

Outliers
Skewed distribution
Wrong scaling
Missing values

Then the model will learn wrong patterns.

Statistics helps us:

Understand center & spread
Detect outliers
Identify skewness and long tails
Scale features & encode categories
Evaluate model performance correctly

📚 Course Modules

Module 01 — Descriptive Statistics & Distributions

Topic	Purpose
Mean, Median, Mode	Measure center of data
Variance & Standard Deviation	Measure spread
Percentiles & Quartiles	Understand rank within dataset
IQR (Interquartile Range)	Outlier detection
Z-Score	Standardization
Distribution Shapes	Symmetric vs Skewed vs Long-tail

Use Median + IQR when data is skewed or contains outliers.
Use Mean + SD when data is symmetrical.

Module 02 — Probability Basics for ML

Events, outcomes, sample space
Conditional probability & independence
Bayes’ Theorem (foundation for Naive Bayes)
Sensitivity, specificity, false positives/negatives
Class imbalance problems

Module 2.5 — Practice Worksheets

Compute mean, median, SD, IQR, fences
Z-score & outlier detection (manual + Python)

Module 03 — Data Quality, Scaling & Encoding

Concept	Why It Matters
Missing Data Types (MCAR/MAR/MNAR)	Correct imputation
Min-Max, Standard & Robust Scaling	Prevents unfair feature influence
One-Hot & Ordinal Encoding	Proper handling of categorical data
Distance Metrics	Used in KNN, Clustering, Embeddings
Covariance & Correlation	Feature relationship understanding
PCA (concept intro)	Dimensionality reduction

Module 3.5 — Hands-on Worksheets

Bayes rule problems
Confusion matrix: Precision, Recall, F1-Score

Module 04 — Quiz / Review

Conceptual wrap-up before applying ML algorithms.

🎯 Week 1 Goal

Build intuition, not memorize formulas.

We learn to:

Summarize data
Detect outliers
Understand real-world distributions
Prepare data for ML models to be accurate, robust & explainable

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
module01		module01
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🧠 Machine Learning Foundations – Statistics, Probability & Data Preparation

📌 What is Statistics?

🤖 Why Statistics is Important for Machine Learning

📚 Course Modules

Module 01 — Descriptive Statistics & Distributions

Module 02 — Probability Basics for ML

Module 2.5 — Practice Worksheets

Module 03 — Data Quality, Scaling & Encoding

Module 3.5 — Hands-on Worksheets

Module 04 — Quiz / Review

🎯 Week 1 Goal

📂 Folder Structure

About

Uh oh!

Releases

Packages

Languages

imashiqe/Machine-Learning

Folders and files

Latest commit

History

Repository files navigation

🧠 Machine Learning Foundations – Statistics, Probability & Data Preparation

📌 What is Statistics?

🤖 Why Statistics is Important for Machine Learning

📚 Course Modules

Module 01 — Descriptive Statistics & Distributions

Module 02 — Probability Basics for ML

Module 2.5 — Practice Worksheets

Module 03 — Data Quality, Scaling & Encoding

Module 3.5 — Hands-on Worksheets

Module 04 — Quiz / Review

🎯 Week 1 Goal

📂 Folder Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages