Welcome to Economics 524 (424): Prediction and machine-learning in econometrics, taught by Ed Rubin and Connor Lennon.
Lecture Tuesday and Thursday, 10:00am–11:50am, 105 Peterson Hall
Lab Friday, 12:00pm–12:50pm, 102 Peterson Hall
Office hours
- Ed Rubin (PLC 519): Thursday (2pm–3pm); Friday (1pm–2pm)
- Connor Lennon (PLC 430): Monday (1pm-2pm)
- R for Data Science
- Introduction to Data Science (not available without purchase)
- The Elements of Statistical Learning
- Why do we have a class on prediction?
- How is prediction (and how are its tools) different from causal inference?
- Motivating examples
001 - Statistical learning foundations
- Why do we have a class on prediction?
- How is prediction (and how are its tools) different from causal inference?
- Motivating examples
- Model accuracy
- Loss for regression and classification
- The variance bias-tradeoff
- The Bayes classifier
- KNN
- Review
- The validation-set approach
- Leave-out-out cross validation
- k-fold cross validation
- The bootstrap
In-class: Validation-set exercise (Kaggle)
004 - Linear regression strikes back
- Returning to linear regression
- Model performance and overfit
- Model selection—best subset and stepwise
- Selection criteria
- Ridge regression
- Lasso
- Elasticnet
- Introduction to classification
- Why not regression?
- But also: Logistic regression
- Assessment: Confusion matrix, assessment criteria, ROC, and AUC
- Introduction to trees
- Regression trees
- Classification trees—including the Gini index, entropy, and error rate
- Introduction
- Bagging
- Random forests
- Boosting
- Hyperplanes and classification
- The maximal margin hyperplane/classifier
- The support vector classifier
- Support vector machines
Intro Predicting sales price in housing data (Kaggle)
Help: Kaggle notebooks
001 KNN and loss (Kaggle notebook)
You will need to sign into you Kaggle account and then hit "Copy and Edit" to add the notebook to your account.
Due 21 January 2020 before midnight.
002 Cross validation and linear regression (Kaggle notebook)
Due 04 February 2020 before midnight.
003 Model selection and shrinkage (Kaggle notebook)
Due 13 February 2020 before midnight.
004 Predicting heart disease (Kaggle competition) | Competition Due 20 February 2020 before midnight.
005 Classifying customer churn (Kaggle competition) | Competition Due In-class 27 February 2020.
Class project Due 12 March 2020 before class.
- General "best practices" for coding
- Working with RStudio
- The pipe (
%>%
)
001 - dplyr and Kaggle notebooks
002 - Cross validation and simulation
- Cross-validation review
- CV and interdependence
- Writing functions
- Introduction to learning via simulation
- Simulation: CV and dependence
Additional R script for simulation
004 - Data cleaning and workflow with tidymodels
005 - Perceptrons and neural nets
Additional Data cleaning in R (with caret
)
- Converting numeric variables to categorical
- Converting categorical variables to dummies
- Imputing missing values
- Standardizing variables (centering and scaling)
- RStudio's recommendations for learning R, plus cheatsheets, books, and tutorials
- YaRrr! The Pirate’s Guide to R (free online)
- UO library resources/workshops
- Eugene R Users
- Workflow diagram: .pdf | .png | .ai
- Python Data Science Handbook by Jake VanderPlas
- Elements of AI
- Caltech professor Yaser Abu-Mostafa: Lectures about machine learning on YouTube
- From Google:
- Geocomputation with R (free online)
- Spatial Data Science (free online)
- Applied Spatial Data Analysis with R