This repository contains foundational learning materials to help you understand data before training machine learning models.
If we donβt understand data first, our models may become biased, unstable, or misleading β so statistics is the first step.
Statistics is the science of collecting, summarizing, analyzing, and interpreting data.
Example:
If we have students' exam scores:
- Mean β Average score
- Median β Middle score
- Standard Deviation β How spread out the scores are
Statistics helps us find meaning in data.
Machine learning models learn patterns from data.
If the data has:
- Outliers
- Skewed distribution
- Wrong scaling
- Missing values
Then the model will learn wrong patterns.
Statistics helps us:
- Understand center & spread
- Detect outliers
- Identify skewness and long tails
- Scale features & encode categories
- Evaluate model performance correctly
| Topic | Purpose |
|---|---|
| Mean, Median, Mode | Measure center of data |
| Variance & Standard Deviation | Measure spread |
| Percentiles & Quartiles | Understand rank within dataset |
| IQR (Interquartile Range) | Outlier detection |
| Z-Score | Standardization |
| Distribution Shapes | Symmetric vs Skewed vs Long-tail |
Use Median + IQR when data is skewed or contains outliers.
Use Mean + SD when data is symmetrical.
- Events, outcomes, sample space
- Conditional probability & independence
- Bayesβ Theorem (foundation for Naive Bayes)
- Sensitivity, specificity, false positives/negatives
- Class imbalance problems
- Compute mean, median, SD, IQR, fences
- Z-score & outlier detection (manual + Python)
| Concept | Why It Matters |
|---|---|
| Missing Data Types (MCAR/MAR/MNAR) | Correct imputation |
| Min-Max, Standard & Robust Scaling | Prevents unfair feature influence |
| One-Hot & Ordinal Encoding | Proper handling of categorical data |
| Distance Metrics | Used in KNN, Clustering, Embeddings |
| Covariance & Correlation | Feature relationship understanding |
| PCA (concept intro) | Dimensionality reduction |
- Bayes rule problems
- Confusion matrix: Precision, Recall, F1-Score
Conceptual wrap-up before applying ML algorithms.
Build intuition, not memorize formulas.
We learn to:
- Summarize data
- Detect outliers
- Understand real-world distributions
- Prepare data for ML models to be accurate, robust & explainable