Code for the book Probability and Statistics for Data Science. A free preprint, videos, code, slides and solutions to exercises are available at https://www.ps4ds.net
- Voting by members of the United States House of Representatives Empirical probability, conditional probability, independence, conditional independence
- Random coin flips Sampling
- 3X3 basketball Olympics tournament Monte Carlo method
- Boxing championship Monte Carlo method
- Videogame Monte Carlo method
- Die rolls (real data) Empirical probability mass function
- Fair die rolls Empirical probability mass function
- Durant's free throw streaks Nonparametric and parametric models, geometric distribution, maximum likelihood
- Maximum likelihood estimation for simulated free throws Parametric model, geometric distribution, maximum likelihood
- Phone calls Nonparametric and parametric models, Poisson distribution, maximum likelihood
- Distribution of the empirical-probability estimator Empirical probability, binomial distribution
- Height Cumulative distribution function, quantiles, probability density function, histogram, kernel density estimation, box plot, Gaussian distribution, maximum likelihood estimation, parametric and nonparametric models
- Gross domestic product Cumulative distribution function, quantiles, probability density function, histogram, kernel density estimation, box plot
- Temperatures in Oxford Box plot, quartiles
- Interarrival times of phone calls Kernel density estimation, nonparametric and parametric models, exponential distribution, maximum likelihood
- Simulating an exponential Inverse transform sampling
- Movie ratings Joint probability mass function, marginal distribution, conditional distribution
- Precipitation in Oregon (and Hawaii and Rhode Island) Joint probability mass function, marginal distribution, conditional distribution, independence, conditional independence
- Curse of dimensionality
- Precipitation time series Markov chains, stationarity
- Car rental Time-homogeneous Markov chains, stationary distribution
- Political affiliation Naive Bayes, classification
- Temperature in Manhattan and Versailles Joint probability density function, marginal distribution, conditional distribution
- More temperature data Joint probability density function, marginal distribution, conditional distribution, conditional independence
- Anthropometric data Joint probability density function, kernel density estimation, Gaussian random vectors, maximum likelihood, parametric and nonparametric models
- 2D kernel density estimation
- Movie duration and earnings Joint probability density function, conditional pdf, independence
- Conditional distribution of Gaussian random variables
- Eigendecomposition analysis of multivariate Gaussian distribution
- Exotic fruit Gaussian random vectors
- Simulating a lake Inverse-transform sampling, dependence between random variables
- Simulating a triangle
- Temperature and precipitation in Mauna Loa Joint distribution of discrete and continuous variables, marginal distributions, conditional distributions, kernel density estimation
- Height and sex Mixture model, Gaussian parametric model, joint distribution of discrete and continuous variables, marginal distributions, conditional distributions
- Height and handedness Joint distribution of discrete and continuous variables, independence, kernel density estimation
- Alzheimer's diagnostics Classification, Gaussian random vectors, Gaussian discriminant analysis, quadratic discriminant analysis, linear discriminant analysis, maximum likelihood, parametric models
- Clustering according to height Gaussian mixture model, expectation maximization algorithm, clustering, unsupervised learning
- Clustering NBA players Gaussian mixture model, expectation maximization algorithm, clustering, unsupervised learning
- Election poll Bayesian parametric modeling, beta distribution, prior and posterior distributions, conjugate prior
- How not to predict an election Bayesian parametric modeling, independence, conditional independence, Monte Carlo method
- NBA salaries Mean, median, outliers
- Movie ratings Sample conditional mean, conditional expectation
- Temperatures in the United States Sample mean, sample variance, sample standard deviation
- Temperature in Manhattan and Versailles Sample conditional mean, regression
- Do private classes improve grades? Causal inference, average treatment effect, confounding factor, adjusting for confounders
- Does title capitalization increase YouTube views? Causal inference, average treatment effect
- Height and NBA stats Correlation coefficient, standardization, explained variance, linear estimation, nonlinear estimation, uncorrelation, independence
- Gaussian random vector Correlation coefficient, explained variance
- Uncorrelation implies independence in Gaussian variables
- Feeding guinea pigs Correlation and causation, confounder, causal inference
- Unemployment in Spain Correlation and causation, confounder, causal inference, linear regression, adjusting for confounding factors
- Height Sample mean, random sampling, law of large numbers, bias, standard error, consistency, Chebyshev bound, convergence in probability, central limit theorem, convergence in distribution
- Gross domestic product Sample mean, random sampling, law of large numbers
- COVID-19 prevalence Sample proportion, random sampling, law of large numbers, bias, standard error, consistency, convergence in probability, central limit theorem, convergence in distribution
- Gambler's paradox Law of large numbers, sample mean
- The law of large numbers does not apply to the Cauchy distribution
- Local economic activity Law of large numbers, consistency of the sample mean, outliers
- Central limit theorem (discrete variables) Central limit theorem, convolution, sum of independent random variables
- Central limit theorem (continuous variables) Central limit theorem, convolution, sum of independent random variables
- Basketball strategy Central limit theorem, Gaussian approximation to the binomial, Monte Carlo method
- Financial crisis Central limit theorem, independence, Monte Carlo method
- Confidence intervals for the sample mean
- Confidence intervals for precipitation Confidence intervals, sample proportion, random sampling
- Bootstrap sample mean The bootstrap, bootstrap standard error, sample mean
- Bootstrap Gaussian confidence intervals
- Bootstrap percentile confidence intervals
- Height and foot length Correlation coefficient, sample correlation coefficient, Gaussian confidence intervals, the bootstrap, bootstrap percentile confidence intervals, Fisher's transformation
- Die rolls Null hypothesis, test statistic, p value, parametric testing, power function
- Antetokounmpo's free throws Null hypothesis, test statistic, p value, two-sample test, z test, one-tailed test, two-tailed test, parametric testing, power function, permutation test, nonparametric testing
- Burger prices Permutation test, nonparametric testing, p value
- Tom Brady and hurricanes Permutation test, nonparametric testing
- Comparing school grades permutation test, nonparametric testing, median
- Clutch three-point shooting Multiple testing, Bonferroni's correction, p value
- Evaluating NBA players Multiple testing, Bonferroni's correction, p value, permutation test, average treatment effect
- Practical significance vs. statistical significance
- P-hacking p-hacking, publication bias
- Gaussian random vector Mean of a random vector vector, covariance matrix, directional variance, principal component analysis, spectral theorem
- Canadian cities Sample mean of a vector, sample covariance matrix, principal component analysis, spectral theorem
- Faces Principal component analysis, dimensionality reduction, sample mean of a vector
- Wheat seeds Sample covariance matrix, principal component analysis, dimensionality reduction
- Face classification Principal component analysis, dimensionality reduction, nearest neighbor
- Temperatures in the United States Sample covariance matrix, singular value decomposition, principal component analysis, low-rank model
- Prediction of movie ratings (cartoon example) Low rank model, singular value decomposition, matrix completion, collaborative filtering, singular-value thresholding, imputation
- Prediction of movie ratings (real data) Low rank model, singular value decomposition, matrix completion, collaborative filtering, singular-value thresholding, imputation
- Topic modeling Low-rank model, singular value decomposition, nonnegative matrix factorization
- Premature mortality in US counties Linear regression, ordinary least squares, coefficient of determination, explained variance
- Cartoon illustration of overfitting and generalization
- Noise amplification in linear regression 1 Ordinary least squares, ridge regression
- Noise amplification in linear regression 2 Ordinary least squares, ridge regression
- Temperature in Versailles Linear regression, ordinary least squares, ridge regression, sparse regression, lasso, regularization, overfitting
- Sparse regression 1 Sparsity, lasso, regularization
- Sparse regression 2 Sparsity, lasso, regularization
- Height and sex Logistic regression, maximum likelihood, log-likelihood, logistic function
- Alzheimer's diagnostics Classification, logistic regression, evaluation of classification models, calibration
- Estimating wheat varieties via softmax regression
- Digit classification Softmax regression, overfitting, regularization
- Temperature estimation via regression trees
- Temperature estimation via tree ensembles Bagging, random forests, boosting, overfitting
- Log-likelihood of classification tree
- Temperature estimation via neural networks Regression, neural networks, deep learning, overfitting, early stopping
- Estimating wheat varieties via a classification tree
- Estimating wheat varieties via neural networks