This repository contains my drafts, practice notes, and tutorial sheets for reviewing and understanding key concepts in statistics. These notes are a work in progress and are meant to help me (and hopefully others) solidify foundational concepts in statistics. The content is based on online educational courses and university programs I have completed.
Statistics is broadly divided into two main types:
Descriptive statistics are used to summarize, organize, and describe the main features of a dataset. They are often used for EDA to focus on presenting data clearly and understandably through measures.
The goal is to provide an overall summary of the data without making predictions or inferences about the population.
Inferential statistics use sample data to make generalizations, predictions, or inferences about a larger population. This involves hypothesis testing, confidence intervals, regression analysis, and other statistical tests to conclude the observed data.
belongs under Descriptive Statistics, which focuses on summarizing and exploring the data (e.g., mean, median, variance, and visualizations like Q-Q plots).
falls under Inferential Statistics, as it involves testing assumptions about the population (e.g., normality tests, comparisons like K-S and KLD).
EDAoften sets the foundation by summarizing and exploring the data.HTbuilds on EDA by using statistical tests to confirm or reject assumptions.
DescriptiveSummarizes, organizes, and describes the main features of a dataset without concluding the population.InferentialUses sample data to make generalizations, predictions, or conclusions about a larger population. Includes hypothesis testing and statistical modeling.
These are the foundational concepts and methods used across both Descriptive and Inferential Statistics.
Population: The entire group of individuals or items you want to studyTarget Population: A specific subset of the population you're focusing on.Available Population: The part of the population that is accessible for samplingSample: The subset of the target population that you can access for your study.
Mean(Central Tendency), average of data points. Includes:Median: Middle value of the data.Mode: Most frequent value.Trimmed Mean: calculates the average by removing a percentage of the highest and lowest values to reduce the impact of outliers.
Variance(Dispersion): Measures how spread out the data is. Includes:Standard Deviation: The square root of variance, showing data spread in the same units.Coefficient of Variation: Ratio of the standard deviation to the mean, useful for comparing variability between different datasets.
Skewness: Measures the asymmetry of data distributionKurtosis: Measures the "tailedness" of the data distribution (how extreme the outliers are).
Min: The smallest data point.Q1: The first quartile (25th percentile).Median: The middle value (50th percentile).Q3: The third quartile (75th percentile).Max: The largest data point.
Normal (Gaussian): Symmetrical distribution where most data points are near the mean.Uniform: All outcomes have equal probability.Bernoulli: Distribution with two possible outcomes (failure/success)Binomial: Distribution for a fixed number of independent trials, each with a success/failure outcome.Multinomial: Generalization of binomial for more than two possible outcomes.
Q-Q Plot: A graphical tool to compare two data distributions. It visually shows how well the data fits the reference distribution, without providing numerical resultsKullback-Leibler Divergence (KLD): Measures the difference between two probability distributions. A value close to zero means the distributions are similar. It’s not used to test normality.Kolmogorov-Smirnov (K-S) Test: Compares sample data to a reference distribution (e.g., normal) to test if the data fits. It provides a p-value to assess the goodness of fit.Shapiro Test: Tests if a dataset follows a normal distribution. It provides a p-value and does not compare multiple datasets.
These are methods and tools specifically tied to EDA (Exploratory Data Analysis) or HT (Hypothesis Testing):
Simple Random Sampling: Every member of the population has an equal chance of being selected.Systematic Sampling: Selecting every "n-th" item from a list.Stratified Sampling: Dividing the population into subgroups and sampling from each subgroup.Cluster Sampling: Dividing the population into clusters, then randomly selecting entire clusters to study.
10 %: A common guideline for choosing sample size in some studies (can vary depending on context).Confidence Level, Confidence Interval: The confidence level shows the probability that a population parameter lies within the confidence interval.Subtopic: Placeholder for other methods related to sample size.
Numerical: Data that represents measurable quantitiesContinuous: Data that can take any value in a range (e.g., height, weight).Discrete: Data that can only take specific values (e.g., number of people).
Categorical: Data that represents categories.Nominal: Categories without a specific order (e.g., colors).Ordinal: Categories with a meaningful order but no fixed difference between them (e.g., ratings from 1 to 5).