This is a collection of notebooks related to baseball analysis. Some are more about the techniques and others are about interesting baseball stuff. The first several notebooks have to do with predicting on-base percentage (OBP). If there is one metric that has persisted into the post-moneyball era into today's statcast era, its the value of OBP. The rest are just things that come to mind and I try them. Some originate from interest in the baseball question and some of them are because I wanted to build a implement a specific model or technique.
NOTE: Some of these require very large Statcast files. The code is usually set to load the data by default, so if there is an error there, change to True
and the data will be gathered and saved locally.
Notebook | Description |
---|---|
01_PredictingOBP-ML.ipynb | Prediciting end of season OBP given early season data. Focuses on regression and simple ML techniques |
02_PredictingOBP-EmpericalBayes.ipynb | This adapts code from my Emperical Bayes repo to estimate OBP using the same technique used for batting average. Does a shrunken estimate that accounts for plate appearances approximately estimate end of season OBP? |
03_PredictingOBP-ARIMA-Forecasting-basic.ipynb | Instead of a classic train/test split using ML, we use a running OBP to try to forecast out to the end of season. This notebook only considers using past OBP to predict future OBP w/o any additional exogenous variables |
04__PredictingOBP-ARIMA-Forecasting-addExog.ipynb | Similar to the last notebook but we introduce exogenous variables to facilitate the forecasting. |
05_OBP_to_SLG.ipynb | OPS is a common metric but it is often critized since the denominators of on-base % and slugging % are different, making the addition mathematically... eh. My question is, what is the relationship between the two? How much is 1 point of OBP worth compared to 1 point of slugging %? |
06_TheBook-Chapter1.ipynb | This notebook replicates some of the tables in Chapter 1 of "The Book" by Tom Tango et al. The data were mined from Baseball Savant and queried using Pandas to make the RE24 table, compute wOBA, and other tables. |
07_CricketData.ipynb | A notebook that plays with some cricket data. |
08_EstimatingTrueExitVelocity.ipynb | Given some noisy data, how can we estimate a player's true average exit velocity. Especially when the number of plate appearances varies signficantly across players. We use linear models, empericial Bayes, and a hierarchical Bayesian model to address this. |
09_PredictingSwingAndMiss.ipynb | This is a simple notebook to see if we can predict a swing and miss from the data. In all honestly it's not the greatest question to address, but it is still interesting. A more interesting question might a continuous variable, like exit velocity, or something more fundamental like the "error", which may be a combination of several parameters. For now it is just an excuse to build a multinomial logistic regression model, use a neural network, and try to implement the same multinomial logistic regression model in a Bayesian framework. |
10_PredictingBallInPlay.ipynb | This is similar to the previous notebook, but instead of predicting swing and miss, we are interested in knowing if the ball is put in play. This is also a bit naive, but makes for a straightforward problem. It is also an excuse to build a logistic regression model in PyMC and also to try to build a BART model for the same problem. |