Introduction to Statistics

This repository contains my drafts, practice notes, and tutorial sheets for reviewing and understanding key concepts in statistics. These notes are a work in progress and are meant to help me (and hopefully others) solidify foundational concepts in statistics. The content is based on online educational courses and university programs I have completed.

Types of Statistics

Statistics is broadly divided into two main types:

1. Descriptive Statistics

Descriptive statistics are used to summarize, organize, and describe the main features of a dataset. They are often used for EDA to focus on presenting data clearly and understandably through measures. The goal is to provide an overall summary of the data without making predictions or inferences about the population.

2. Inferential Statistics

Inferential statistics use sample data to make generalizations, predictions, or inferences about a larger population. This involves hypothesis testing, confidence intervals, regression analysis, and other statistical tests to conclude the observed data.

Type of Analysis:

1. EDA (Exploratory Data Analysis)

belongs under Descriptive Statistics, which focuses on summarizing and exploring the data (e.g., mean, median, variance, and visualizations like Q-Q plots).

2. HT (Hypothesis Testing)

falls under Inferential Statistics, as it involves testing assumptions about the population (e.g., normality tests, comparisons like K-S and KLD).

General Comparison & Overlap

EDA vs. HT:

EDA often sets the foundation by summarizing and exploring the data.
HT builds on EDA by using statistical tests to confirm or reject assumptions.

Descriptive vs. Inferential:

Descriptive Summarizes, organizes, and describes the main features of a dataset without concluding the population.
Inferential Uses sample data to make generalizations, predictions, or conclusions about a larger population. Includes hypothesis testing and statistical modeling.

1. General Concepts

These are the foundational concepts and methods used across both Descriptive and Inferential Statistics.

Population and Sample

Population: The entire group of individuals or items you want to study
Target Population: A specific subset of the population you're focusing on.
Available Population: The part of the population that is accessible for sampling
Sample: The subset of the target population that you can access for your study.

Moments:

Mean (Central Tendency), average of data points. Includes:
1. Median: Middle value of the data.
2. Mode: Most frequent value.
3. Trimmed Mean: calculates the average by removing a percentage of the highest and lowest values to reduce the impact of outliers.
Variance (Dispersion): Measures how spread out the data is. Includes:
1. Standard Deviation: The square root of variance, showing data spread in the same units.
2. Coefficient of Variation: Ratio of the standard deviation to the mean, useful for comparing variability between different datasets.
Skewness: Measures the asymmetry of data distribution
Kurtosis: Measures the "tailedness" of the data distribution (how extreme the outliers are).

Five-Number Summary

Min: The smallest data point.
Q1: The first quartile (25th percentile).
Median: The middle value (50th percentile).
Q3: The third quartile (75th percentile).
Max: The largest data point.

Distributions

Normal (Gaussian): Symmetrical distribution where most data points are near the mean.
Uniform: All outcomes have equal probability.
Bernoulli: Distribution with two possible outcomes (failure/success)
Binomial: Distribution for a fixed number of independent trials, each with a success/failure outcome.
Multinomial: Generalization of binomial for more than two possible outcomes.

Statistical Comparison Metrics

Q-Q Plot: A graphical tool to compare two data distributions. It visually shows how well the data fits the reference distribution, without providing numerical results
Kullback-Leibler Divergence (KLD): Measures the difference between two probability distributions. A value close to zero means the distributions are similar. It’s not used to test normality.
Kolmogorov-Smirnov (K-S) Test: Compares sample data to a reference distribution (e.g., normal) to test if the data fits. It provides a p-value to assess the goodness of fit.
Shapiro Test: Tests if a dataset follows a normal distribution. It provides a p-value and does not compare multiple datasets.

2. Specific Concepts

These are methods and tools specifically tied to EDA (Exploratory Data Analysis) or HT (Hypothesis Testing):

Sampling Methods

Simple Random Sampling: Every member of the population has an equal chance of being selected.
Systematic Sampling: Selecting every "n-th" item from a list.
Stratified Sampling: Dividing the population into subgroups and sampling from each subgroup.
Cluster Sampling: Dividing the population into clusters, then randomly selecting entire clusters to study.

Sample Size

10 %: A common guideline for choosing sample size in some studies (can vary depending on context).
Confidence Level, Confidence Interval: The confidence level shows the probability that a population parameter lies within the confidence interval.
Subtopic: Placeholder for other methods related to sample size.

Data Types

Numerical: Data that represents measurable quantities
1. Continuous: Data that can take any value in a range (e.g., height, weight).
2. Discrete: Data that can only take specific values (e.g., number of people).
Categorical: Data that represents categories.
1. Nominal: Categories without a specific order (e.g., colors).
2. Ordinal: Categories with a meaningful order but no fixed difference between them (e.g., ratings from 1 to 5).

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
01_EDA_Moments_in_Statistics.ipynb		01_EDA_Moments_in_Statistics.ipynb
01_EDA_kurtosis.jpg		01_EDA_kurtosis.jpg
01_EDA_normal_distribution.png		01_EDA_normal_distribution.png
01_EDA_skewness.jpg		01_EDA_skewness.jpg
02_EDA_Five_Number_and_Data_Visualization.ipynb		02_EDA_Five_Number_and_Data_Visualization.ipynb
02_five_number_summary.jpg		02_five_number_summary.jpg
03_EDA_Data_Distribution.ipynb		03_EDA_Data_Distribution.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction to Statistics

Types of Statistics

1. Descriptive Statistics

2. Inferential Statistics

Type of Analysis:

1. EDA (Exploratory Data Analysis)

2. HT (Hypothesis Testing)

General Comparison & Overlap

EDA vs. HT:

Descriptive vs. Inferential:

1. General Concepts

Population and Sample

Moments:

Five-Number Summary

Distributions

Statistical Comparison Metrics

2. Specific Concepts

Sampling Methods

Sample Size

Data Types

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Introduction to Statistics

Types of Statistics

1. Descriptive Statistics

2. Inferential Statistics

Type of Analysis:

**1. EDA (Exploratory Data Analysis) **

2. HT (Hypothesis Testing)

General Comparison & Overlap

EDA vs. HT:

Descriptive vs. Inferential:

1. General Concepts

Population and Sample

Moments:

Five-Number Summary

Distributions

Statistical Comparison Metrics

2. Specific Concepts

Sampling Methods

Sample Size

Data Types

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. EDA (Exploratory Data Analysis)

Packages