Rework of FERC to EIA logistic regression model #2276

katie-lamb · 2023-02-03T23:59:24Z

PR Overview

There are still some fixes that need to be made (see below check list) but feel free to take a look since there are a bunch of changes.

The main benefit of these changes is the speed increase. On my machine it takes me 25 seconds to run all of execute. But almost all of this speed increase comes from replacing ModelTuner with GridSearchCV. The other stuff that I rewrote (MatchManager functions to override best matches and remove murky matches) is maybe just a nicer way to do things. If it's too much change, then we can just use the GridSearchCV model change.

On 5 years of data (2015 - 2020) I get the following stats from the best model on the validation set:
Accuracy: 0.88
F-Score: 0.92
Precision: 0.9
Recall: 0.94

Percent of training data matches correctly predicted: 0.86
Percent of training data overwritten in matches: 0.083

When I tried to run the original model (in Christina's branch) for comparison the recordlinkage model was going too crazy and crashed my computer LOL so I need to go back and run that. For some reason that model won't work for me and I can't get this logger to silence itself so it's hard to debug.

Places That Changed Significantly

I changed the way that one-to-many FERC to EIA matches are reduced to one match. See my comments around the remove_murky_matches functions for more specifics. I wasn't sure if there were very intentional reasons why you removed murky matches in the way that you did or if this could be done in a simpler way. I think this also depends on the degree of certainty required for the matches.

Remaining To Do

If you think it's appropriate, there could probably be more modularization of the new functions I wrote to put them into nice classes. In particular, a MatchManager function might be relevant. But maybe it's simple enough to have as three separate functions.
The pre-commit isn't passing. But I think this is a known issue. Probably need to merge dev into ferc_eia_rl and then subsequently into ferc-eia-rl-katie-rework.
When I run on 5 years of data the metrics on the coverage of data for each FERC plant type is all 0%. This is messed up. Or was it always messed. Disregard, I fixed this by taking out the round.
Docstrings are basically non existent. Lots more docstrings and comments to be added.

PR Checklist

Merge the most recent version of the branch you are merging into (probably dev).
All CI checks are passing. Run tests locally to debug failures
Make sure you've included good docstrings.
For major data coverage & analysis changes, run data validation tests
Include unit tests for new functions and classes.
Defensive data quality/sanity checks in analyses & data processing functions.
Update the release notes and reference reference the PR and related issues.
Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

codecov · 2023-02-07T03:14:30Z

Codecov Report

Base: 86.0% // Head: 85.8% // Decreases project coverage by -0.2% ⚠️

Coverage data is based on head (39ce143) compared to base (1badd49).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files

@@              Coverage Diff              @@
##           ferc_eia_rl   #2276     +/-   ##
=============================================
- Coverage         86.0%   85.8%   -0.2%     
=============================================
  Files               74      74             
  Lines             9334    9237     -97     
=============================================
- Hits              8030    7931     -99     
- Misses            1304    1306      +2

Impacted Files	Coverage Δ
src/pudl/analysis/ferc1_eia.py	`92.1% <100.0%> (-3.1%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.