Skip to content

Alis-AI/Statistics-for-Machine-Learning

 
 

Repository files navigation

Statistics for Machine Learning

Overview

This repository began as a collection of notebooks covering statistics topics that are useful for understanding machine learning methods. Over time, this has evolved to cover topics from the very fundamentals of statistics, linear algebra and data science, to building many of the most common machine learning models in industry from scratch. The repository is split into chapters, each tackling a specific topic. Each chapter is made up of several notebooks which dive into the theory behind different areas of machine learning. These derive the relevent equations using KaTex and implement the methods programmatically using Python. The end of every notebook contains a Further Reading section which points to useful resources that can be used to explore each topic further.

 

Repository Highlights

image image
k-Means Clustering algorithm written in Python, implementing k-Means++ intelligent centroid spacing. Agglomerative Hierarchical Clustering algorithm written in Python, offering 4 different linkage methods.

 

Chapter 1 - Statistics Fundamentals

1.1 - Introduction to Statistics

1.2 - Basic Data Visualisation

1.3 - Probability & Baye's Theorem

1.4 - Probability Distributions & Expected Values

1.5 - Distributions in Data (Including Log Normal Distributions)

1.6 - Sampling Distributions & Estimators

1.7 - Confidence Intervals & t-Distributions

1.8 - Hypothesis Testing & p-Values

1.9 - Covariance and the Covariance Matrix

1.10 - Pearson's Correlation Coefficient and R Squared

 

Chapter 2 - Machine Learning Fundamentals

2.1 - Introduction to Machine Learning

2.2 - Bias vs Variance Trade-off

2.3 - Model Evaluation Metrics

2.4 - Machine Learning Pipelines

 

Chapter 3 - Supervised Learning: Regression

3.1 - Simple Linear Regression

3.2 - Multiple Regression

3.3 - Regression Trees

3.4 - Random Forests

 

Chapter 4 - Supervised Learning: Classification

4.1 - Logistic Regression

4.2 - k-Nearest Neighbor Classifier

4.3 - Naive Bayes

4.4 - Support Vector Machines

4.5 - Classification Trees

 

Chapter 5 - Unsupervised Learning

5.1 - K-Means Clustering

5.2 - Hierarchical Agglomerative Clustering

5.3 - Association Learning & Market Basket Analysis

5.4 - Principle Component Analysis

 

Chapter 6 - Neural Networks and Deep Learning

6.1 - Multi-Layer Perceptrons

 

Chapter 7 - Natural Language Processing

7.1 - Introduction to Large Language Models

7.2 - The Tokenization Pipeline

 

Future Work

Introduction to Statistics

  • Update reference to Sampling a Distribution & Bessel's Correction

Basic Data Visualisation

  • Add Venn diagrams and time series plots

Sampling a Distibution & Bessel's Correction

  • Describe coefficient of variation
  • A explanation for how to sample data, and what design decisions to make

 


Further Reading

[1] An Introduction to the Science of Statistics: From Theory to Implementation - Preliminary Edition (Joseph C. Watkins)

[2] Introduction to Statistics and Data Analysis - 3rd Edition (Roxy Peck, Chris Olsen, Jay Devore)

[3] An Introduction to Probability and Simulation (Kevin Ross)

[4] The Elements of Statistical Learning - Data Mining, Inference and Prediction, Second Edition (Hastie et al)

[5] Interpretable Machine Learning - A Guide for Making Black Box Models Explainable, Second Edition (Christoph Molnar)

[6] Introduction to Data Mining - Second Edition (Pang-Ning Tan et al)

About

Jupyter Notebooks explaining statistics concepts for machine learning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%