This repository began as a collection of notebooks covering statistics topics that are useful for understanding machine learning methods. Over time, this has evolved to cover topics from the very fundamentals of statistics, linear algebra and data science, to building many of the most common machine learning models in industry from scratch. The repository is split into chapters, each tackling a specific topic. Each chapter is made up of several notebooks which dive into the theory behind different areas of machine learning. These derive the relevent equations using KaTex and implement the methods programmatically using Python. The end of every notebook contains a Further Reading section which points to useful resources that can be used to explore each topic further.
1.1 - Introduction to Statistics
1.2 - Basic Data Visualisation
1.3 - Probability & Baye's Theorem
1.4 - Probability Distributions & Expected Values
1.5 - Distributions in Data (Including Log Normal Distributions)
1.6 - Sampling Distributions & Estimators
1.7 - Confidence Intervals & t-Distributions
1.8 - Hypothesis Testing & p-Values
1.9 - Covariance and the Covariance Matrix
1.10 - Pearson's Correlation Coefficient and R Squared
2.1 - Introduction to Machine Learning
2.2 - Bias vs Variance Trade-off
2.3 - Model Evaluation Metrics
2.4 - Machine Learning Pipelines
3.1 - Simple Linear Regression
3.2 - Multiple Regression
3.3 - Regression Trees
3.4 - Random Forests
4.1 - Logistic Regression
4.2 - k-Nearest Neighbor Classifier
4.3 - Naive Bayes
4.4 - Support Vector Machines
4.5 - Classification Trees
5.1 - K-Means Clustering
5.2 - Hierarchical Agglomerative Clustering
5.3 - Association Learning & Market Basket Analysis
5.4 - Principle Component Analysis
6.1 - Multi-Layer Perceptrons
7.1 - Introduction to Large Language Models
7.2 - The Tokenization Pipeline
- Update reference to Sampling a Distribution & Bessel's Correction
- Add Venn diagrams and time series plots
- Describe coefficient of variation
- A explanation for how to sample data, and what design decisions to make
[2] Introduction to Statistics and Data Analysis - 3rd Edition (Roxy Peck, Chris Olsen, Jay Devore)
[3] An Introduction to Probability and Simulation (Kevin Ross)
[6] Introduction to Data Mining - Second Edition (Pang-Ning Tan et al)

