Made by Team 5 - Code Monkeys for "Application Programming for Engineers" in the ME department of the University of Texas at Austin.
Our baseball statistics are pulled from Lahman's Baseball Database. The statistics were available to freely download and use a data dictionary can be found here.
These files can be found in the baseball_data/ directory and our investigations primarily used files from the baseball_data/core/ subdirectory, which includes counting stats for batting and fielding, as well as metadata for each player and team, on a year-by-year basis.
The political data used was pulled from Wikipedia and its data was stored locally in the political_data/ directory. We used these files to determine the political climate in any given year.
As we began to analyze our data, we realized that the given data was not sufficient, as there was not a singular value we could use to compare players in terms of how well they were playing. To bridge this gap, we chose to gather WAR and other advanced statistics from another source.
All code for webscraping can be found in the webscraper/ directory, where we crawled through all qualifying pitchers
and batters to gather more advanced statistics like WAR (Wins Above Replacement) from Baseball Reference,
the leading stats website for baseball data.
The Scraper.py file contains the underlying logic that searches for the tables, which are hidden in comments in the HTML of the webpages.
The configuration files, batting_ids.csv and pitching_ids were generated by taking all qualifying players who played after 1899 and grabbing the unique bbrefID values.
To run the scraper code, open a terminal in the webscraper/ directory and run both batting_scraper.py and pitcher_scraper.py. These files create a process pool to scrape the pages in parallel. We were scared that this would trigger some sort of anti-scraping or anti-DDOS from Baseball Reference, but we were able to scrape all our stats without issue.
The files for our Dash app are hosted in dash_app. The app can be run locally by opening a terminal in the aforementioned directory and running python3 app.py or your OS-equivalent.
The data pipeline can be found in the dash_app/data_processing/Dataset.py file. This file does several things:
- Pull in the simple counting stats for batting or pitching
- Generate the minimum plate appearances (batting) or outs pitched (pitching) to be noteworthy. These numbers are based on MLB post-season award cutoffs
- Remove all data prior to 1899, as the first significant period of baseball, the "Dead Ball" era began around approximately this time
- Merge in player metadata from
Players.csvto get their fullname and some personal information. - Merge in political data for each year for each player
- Generate the "Political Score" for each player
- The political score is calculated by averaging the WAR of each player in each year there was a Republican in office, the doing the same for Democrats. By taking the difference of these two numbers, we can get a relative performance increase or decrease based on the reigning political party.
- Scale and normalize numerical data in preparation for machine learning algorithms
We used GMM and K-Means clustering to model our data and find if there were certain relationships between subgroups of players. Additionally, we used KNN to find the most similar players based on Minkowski distance.
Back in the main dash_app/app.py file, we import the processed data and begin generating our plots, as can be seen in the website.
We create interactive widgets to get similar players as well as plot various features against one another.
Next, we generate correlation matrices as well as a comparison of Political Score vs. WAR.
Finally, we begin to process our scaled data by creating clusters for both pitchers and batters using both K-Means and Gaussian Mixture Modelling to see if we can generate meaningful clusters. In order to visualize this data, we perform PCA on the full scaled dataset, while removing our cluster labels to see if they pass the "feel test".
To prepare for the production environment, we had to generate several files such as Procfile and gunicorn_config.py to configure how many workers and the entrypoint of our web app.
Additionally, we have configured the app in such a way that we can containerize it. If you already have Docker installed,
simply open a terminal in the root of this project and run docker build -t bbdash ..
Then, you can run the website locally in a docker instance by running docker run -p 8085:8085 bbdash and opening http://localhost:8085 on any web browser.