BaseballDashboard

Made by Team 5 - Code Monkeys for "Application Programming for Engineers" in the ME department of the University of Texas at Austin.

Dataset

Baseball

Our baseball statistics are pulled from Lahman's Baseball Database. The statistics were available to freely download and use a data dictionary can be found here. These files can be found in the baseball_data/ directory and our investigations primarily used files from the baseball_data/core/ subdirectory, which includes counting stats for batting and fielding, as well as metadata for each player and team, on a year-by-year basis.

Political

The political data used was pulled from Wikipedia and its data was stored locally in the political_data/ directory. We used these files to determine the political climate in any given year.

Modules

Webscraping

As we began to analyze our data, we realized that the given data was not sufficient, as there was not a singular value we could use to compare players in terms of how well they were playing. To bridge this gap, we chose to gather WAR and other advanced statistics from another source.

All code for webscraping can be found in the webscraper/ directory, where we crawled through all qualifying pitchers and batters to gather more advanced statistics like WAR (Wins Above Replacement) from Baseball Reference, the leading stats website for baseball data. The Scraper.py file contains the underlying logic that searches for the tables, which are hidden in comments in the HTML of the webpages. The configuration files, batting_ids.csv and pitching_ids were generated by taking all qualifying players who played after 1899 and grabbing the unique bbrefID values. To run the scraper code, open a terminal in the webscraper/ directory and run both batting_scraper.py and pitcher_scraper.py. These files create a process pool to scrape the pages in parallel. We were scared that this would trigger some sort of anti-scraping or anti-DDOS from Baseball Reference, but we were able to scrape all our stats without issue.

Dash

The files for our Dash app are hosted in dash_app. The app can be run locally by opening a terminal in the aforementioned directory and running python3 app.py or your OS-equivalent.

Data Processing

The data pipeline can be found in the dash_app/data_processing/Dataset.py file. This file does several things:

Pull in the simple counting stats for batting or pitching
Generate the minimum plate appearances (batting) or outs pitched (pitching) to be noteworthy. These numbers are based on MLB post-season award cutoffs
Remove all data prior to 1899, as the first significant period of baseball, the "Dead Ball" era began around approximately this time
Merge in player metadata from Players.csv to get their fullname and some personal information.
Merge in political data for each year for each player
Generate the "Political Score" for each player
1. The political score is calculated by averaging the WAR of each player in each year there was a Republican in office, the doing the same for Democrats. By taking the difference of these two numbers, we can get a relative performance increase or decrease based on the reigning political party.
Scale and normalize numerical data in preparation for machine learning algorithms

Machine Learning

We used GMM and K-Means clustering to model our data and find if there were certain relationships between subgroups of players. Additionally, we used KNN to find the most similar players based on Minkowski distance.

Plotting

Back in the main dash_app/app.py file, we import the processed data and begin generating our plots, as can be seen in the website. We create interactive widgets to get similar players as well as plot various features against one another. Next, we generate correlation matrices as well as a comparison of Political Score vs. WAR. Finally, we begin to process our scaled data by creating clusters for both pitchers and batters using both K-Means and Gaussian Mixture Modelling to see if we can generate meaningful clusters. In order to visualize this data, we perform PCA on the full scaled dataset, while removing our cluster labels to see if they pass the "feel test".

Gunicorn

To prepare for the production environment, we had to generate several files such as Procfile and gunicorn_config.py to configure how many workers and the entrypoint of our web app.

Docker

Additionally, we have configured the app in such a way that we can containerize it. If you already have Docker installed, simply open a terminal in the root of this project and run docker build -t bbdash .. Then, you can run the website locally in a docker instance by running docker run -p 8085:8085 bbdash and opening http://localhost:8085 on any web browser.

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
baseball_data		baseball_data
dash_app		dash_app
political_data		political_data
webscraper		webscraper
.gitignore		.gitignore
Dockerfile		Dockerfile
Irvin_EDA.ipynb		Irvin_EDA.ipynb
LICENSE		LICENSE
README.md		README.md
huidi.ipynb		huidi.ipynb
irvin_political.ipynb		irvin_political.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BaseballDashboard

Dataset

Baseball

Political

Modules

Webscraping

Dash

Data Processing

Machine Learning

Plotting

Gunicorn

Docker

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

texasfight/BaseballDashboard

Folders and files

Latest commit

History

Repository files navigation

BaseballDashboard

Dataset

Baseball

Political

Modules

Webscraping

Dash

Data Processing

Machine Learning

Plotting

Gunicorn

Docker

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages