Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

label statcast data as likely to be imputed #71

Merged
merged 2 commits into from
May 28, 2018

Conversation

bdilday
Copy link
Contributor

@bdilday bdilday commented May 3, 2018

Statcast releases data for which the values of launch angle and launch speed have been imputed, but do not include a flag to indicate which ones. This code uses a set of heuristics (essentially group by launch angle, launch speed, and event and label any that occur more than X times as likely imputed) to estimate which values are the imputed ones and provides a function to append a new imputed column to a data frame. The imputed values are read from a CSV file. A default based on my analysis is included in inst/extdata, but it is extensible if the estimates change or if the user has their own estimates.

@BillPetti
Copy link
Owner

BillPetti commented May 22, 2018

Hey Ben, this is really cool.

Can you give more detail on how you determined which combinations of la, lv, event, and bb_type likely contain imputed values? I was also unclear why some of your default values for la and lv are 6 digits.

Also, you are importing an inner_join, but in the function you are using left_join. The parameter for the data also is different in the YAML compared to the function.

* import left_join from dplyr instead of inner_join

* change parameter name in the label_statcast_imputed_data documentation
  to match the function argument
@bdilday bdilday force-pushed the label-statcast-imputed-data-pr branch from e7abfdc to 0f941e0 Compare May 27, 2018 13:24
@bdilday
Copy link
Contributor Author

bdilday commented May 27, 2018

Can you give more detail on how you determined which combinations of
la, lv, event, and bb_type likely contain imputed values?

Sure - see this script that shows how I derived the likely-imputed data,
https://gist.github.com/bdilday/c6628735d8ae6e0c1e85bfee78b0e7cc

this is a good reference also - I largely implemented the ideas here.
https://www.fangraphs.com/tht/43416-2/

I was also unclear why some of your default values for la and lv are 6 digits.

The launch angle and launch speed are multiplied by 10000 (or whatever you use for the inverse_precision variable) and rounded to integer. In other words I'm binning the data, in really narrow bins. The idea behind this was there'll be no ambiguity about whether floating point numbers are equal when we do counting and joining.

Also, you are importing an inner_join, but in the function you are using left_join.
The parameter for the data also is different in the YAML compared to the function.

Fixed.

@BillPetti
Copy link
Owner

Awesome, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants