Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework of FERC to EIA logistic regression model #2276

Merged
merged 13 commits into from
Feb 11, 2023

Conversation

katie-lamb
Copy link
Member

@katie-lamb katie-lamb commented Feb 3, 2023

PR Overview

There are still some fixes that need to be made (see below check list) but feel free to take a look since there are a bunch of changes.

The main benefit of these changes is the speed increase. On my machine it takes me 25 seconds to run all of execute. But almost all of this speed increase comes from replacing ModelTuner with GridSearchCV. The other stuff that I rewrote (MatchManager functions to override best matches and remove murky matches) is maybe just a nicer way to do things. If it's too much change, then we can just use the GridSearchCV model change.

On 5 years of data (2015 - 2020) I get the following stats from the best model on the validation set:
Accuracy: 0.88
F-Score: 0.92
Precision: 0.9
Recall: 0.94

Percent of training data matches correctly predicted: 0.86
Percent of training data overwritten in matches: 0.083

When I tried to run the original model (in Christina's branch) for comparison the recordlinkage model was going too crazy and crashed my computer LOL so I need to go back and run that. For some reason that model won't work for me and I can't get this logger to silence itself so it's hard to debug.

Places That Changed Significantly

I changed the way that one-to-many FERC to EIA matches are reduced to one match. See my comments around the remove_murky_matches functions for more specifics. I wasn't sure if there were very intentional reasons why you removed murky matches in the way that you did or if this could be done in a simpler way. I think this also depends on the degree of certainty required for the matches.

Remaining To Do

  • If you think it's appropriate, there could probably be more modularization of the new functions I wrote to put them into nice classes. In particular, a MatchManager function might be relevant. But maybe it's simple enough to have as three separate functions.
  • The pre-commit isn't passing. But I think this is a known issue. Probably need to merge dev into ferc_eia_rl and then subsequently into ferc-eia-rl-katie-rework.
  • When I run on 5 years of data the metrics on the coverage of data for each FERC plant type is all 0%. This is messed up. Or was it always messed. Disregard, I fixed this by taking out the round.
  • Docstrings are basically non existent. Lots more docstrings and comments to be added.

PR Checklist

  • Merge the most recent version of the branch you are merging into (probably dev).
  • All CI checks are passing. Run tests locally to debug failures
  • Make sure you've included good docstrings.
  • For major data coverage & analysis changes, run data validation tests
  • Include unit tests for new functions and classes.
  • Defensive data quality/sanity checks in analyses & data processing functions.
  • Update the release notes and reference reference the PR and related issues.
  • Do your own explanatory review of the PR to help the reviewer understand what's going on and identify issues preemptively.

@katie-lamb katie-lamb marked this pull request as draft February 3, 2023 23:59
@codecov
Copy link

codecov bot commented Feb 7, 2023

Codecov Report

Base: 86.0% // Head: 85.8% // Decreases project coverage by -0.2% ⚠️

Coverage data is based on head (39ce143) compared to base (1badd49).
Patch coverage: 100.0% of modified lines in pull request are covered.

Additional details and impacted files
@@              Coverage Diff              @@
##           ferc_eia_rl   #2276     +/-   ##
=============================================
- Coverage         86.0%   85.8%   -0.2%     
=============================================
  Files               74      74             
  Lines             9334    9237     -97     
=============================================
- Hits              8030    7931     -99     
- Misses            1304    1306      +2     
Impacted Files Coverage Δ
src/pudl/analysis/ferc1_eia.py 92.1% <100.0%> (-3.1%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

so weight the feature vecotrs. With the sum of the weighted feature vectors for
each model match, this method selects the hightest scoring match via
:meth:`calc_best_matches()`.
def remove_murky_matches(match_df, difference_threshold):
Copy link
Member Author

@katie-lamb katie-lamb Feb 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, you were choosing a best match based on weighting the feature vectors by their coefficients and finding the interquartile range. Do you remember if there was a reason to do this method for selecting the best FERC to EIA match? Also, if I understood correctly, the murky matches were just being dropped even if it means that a FERC record is left with no match?

Here, I thought I'd experiment a little with a simpler method. For each FERC record, I choose the highest probability EIA record match as long as the difference between that record and the next highest probability is over a set threshold. But happy to go back to the interquartile range. It might perform slightly better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the answer to this also depends on how certain you want matches to be. And that can also be dictated by the thresholds that are set.

@katie-lamb katie-lamb self-assigned this Feb 7, 2023
@katie-lamb katie-lamb marked this pull request as ready for review February 7, 2023 07:29
@@ -1263,7 +809,7 @@ def _get_subtable(table_name):
return r_eia_years[r_eia_years.record_id_ferc1.str.contains(f"{table_name}")]

def _get_match_pct(df):
return round(len(df[df["record_id_eia"].notna()]) / len(df))
return len(df[df["record_id_eia"].notna()]) / len(df)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This round was rounding to either 1 or 0 since the number it's rounding is between 0 and 1. You can get rid of this round entirely since the logged coverage below is rounding to .01 in the f string.

Copy link
Member Author

@katie-lamb katie-lamb Feb 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this comment here since I can't comment on the actual line itself.

In the below coverage stats, the coverage for each of the FERC tables is printing the coverage percentage for all years of EIA data, not just the subset that was included in the pudl_out object that generated the plant parts list for this model run. So when the CI runs, the coverage numbers are going to be low because they were only ever looking at 5 years of data. But I guess the numbers are accurate, because with 5 years of data the coverage across all EIA working years should be low. It's just not a very helpful stat, but maybe it doesn't really matter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh good! i noticed that these outputs were all 0% when i moved it over and it was wigging me out but i hadn't investigated why yet.

@katie-lamb katie-lamb changed the title WIP: rework of FERC to EIA logistic regression model Rework of FERC to EIA logistic regression model Feb 8, 2023
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks awesome! It'd be great if you could add some arg docs but honestly besides that it looks great. i appreciate your actual knowledge of sklearn 🤣

Comment on lines 515 to 517
match_df = match_df[
(match_df["count"] == 1) | match_df["diff"] >= difference_threshold
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the thing that this doesn't do which i was previously doing is communicating how many of the best matches are outside of the difference_threshold. iirc I used the murk and ties very early on to identify the need to generate the true_gran flag to remove those records that were basically the same that the model couldn't distinguish. So in retrospect I'm not totally sure this is needed anymore... but the last time I ran the model I was still getting ~6% of records that we effectively weren't choosing as matches because they didn't meet this difference threshold.

    best match v ferc:    65.46%
    best match vs matches:91.10%
    murk vs matches:      5.52%
    ties vs matches:      1.17%

Copy link
Member Author

@katie-lamb katie-lamb Feb 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay I didn't get what these stats meant before.

I added "Percent of predicted matches that have a best match" and "Percent of matches not meeting match probability difference threshold".

src/pudl/analysis/ferc1_eia.py Show resolved Hide resolved
@katie-lamb
Copy link
Member Author

@cmgosnell I added docstrings, got the CI to pass, and put in stats about the number of murky matches that are thrown out. So I think this PR is pretty much ready to go. Two little lingering questions:

In the prettyify_best_matches function there are stats logged about the percent of coverage for each FERC plant table. Currently these percentages represent coverage from all years of data, not just the years of data that the model is being run on. So when the CI runs on fewer years of data, the coverage percentages are very low. Is this okay? These don't seem to be numbers that are looked at very often, but if someone did run the CI and see low coverage numbers, I wouldn't want them to be worried.

In the original model, "murky" matches were EIA records that were not distinct enough from the next best EIA match. In this variation, "murky" matches are EIA records whose probability of being a match is not a big enough distinction from the next best match i.e. 91% probability of match versus 90% probability of match. When we're looking at differences in probability, not differences in the actual feature vectors, does it still make sense to throw out "murky" matches? Should I go back to looking at the difference in the distinctness of the feature vectors, since that aligns more with the original intention of "murkiness"?

@cmgosnell
Copy link
Member

cmgosnell commented Feb 10, 2023

@katie-lamb ty for all of this!! I don't have straightforward answers to your questions 🙃

coverage

I thiiiink the coverage logging is okay(-ish)! The input to it is connects_ferc1_eia which is the matches found, which should only contain the years of data in the PudlTabl tables and/or the years of data in the db - which on the CI should only be two. It restricts the years based on working 860 years because FERC1 has wayyy more older years and if we didn't do that the coverage results would look greatly reduced when you are running it with all of the old ferc1 data.

I could very much be wrong about that and I'm a little surprised that the steam table's matches are so low in this pr's action logs:

Coverage for matches during EIA working years:
    Fuel type: 98.6%
    Tech type: 97.5%
Coverage for all steam table records during EIA working years:
    EIA matches: 47.9%
Coverage for all small gen table records during EIA working years:
    EIA matches: 30.1%
Coverage for all hydro table records during EIA working years:
    EIA matches: 75.9%
Coverage for all pumped storage table records during EIA working years:
    EIA matches: 4.8%

murk

I think your right that this new version of murk is less like the old version of murk. I was led down the path of determining the distinctiveness of the # 1 and # 2 winning-iest match because it felt like the only way to know whether there were two records that were too similar in the models eyes. I'm not sure if this is the right way to think about it at all. I could certainly be persuaded that we should just always take the # 1 most probable result.

In my last run of the rl before these changes we were dropping ~6% of matches bc they were murky wins, but it looks like your version of murk drops way more:

Percent of predicted matches that have a best match: 74.04%
Percent of matches not meeting match probability difference threshold: 25.96%

@katie-lamb
Copy link
Member Author

It restricts the years based on working 860 years because FERC1 has wayyy more older years and if we didn't do that the coverage results would look greatly reduced when you are running it with all of the old ferc1 data.

Yep, I should have been more clear about my question. To use an example, if connects_ferc1_eia has EIA data from 2020, and 2021 because the CI is running, then the working 860 years are still 2000 - 2021 (or whatever it really is). So then the coverage will be quite low because the length of the not null records in connects_ferc1_eia will be divided by all the EIA records from 2000 - 2021, not just the 2020 and 2021 records.

@katie-lamb
Copy link
Member Author

I'm not sure if this is the right way to think about it at all. I could certainly be persuaded that we should just always take the # 1 most probable result.

I think when trying to understand how the model was thinking about similar matches and distinguishing between records we run the risk of oversimplifying it. Certainly Logistic Regression is fairly intuitive, but I still think we should use a metric that the model provides. It seems more defensible and ultimately the model is making a decision about what is a match and it's what not, so we shouldn't add that many layers of our own decision making on top of that. I could be persuaded to get rid of the concept of "murky matches" entirely as well and just use the highest scoring match.

In my last run of the rl before these changes we were dropping ~6% of matches bc they were murky wins, but it looks like your version of murk drops way more:

Percent of predicted matches that have a best match: 74.04%
Percent of matches not meeting match probability difference threshold: 25.96%

This percentage of murky matches being dropped is really high because the CI is currently only running on 2 years of data (2020 and 2021, I left a comment about this in your original PR). I think this results in the model overfitting, and I don't think these numbers are a good representation because the data is so limited with only 2 years. When I run it on 2015 - 2020 it drops 10% of matches, and when I bump the score threshold down to .01 it drops 6% of matches.

@katie-lamb
Copy link
Member Author

Update update:

  • I took out the murky match function, now it just chooses the highest scoring match for each FERC record

Stats on 5 years of data without the murky matches removed (seems pretty much the same):

Percent of training data matches correctly predicted: 0.86
Percent of training data overwritten in matches: 0.094
  • I looked back at the FERC table coverage logging and you right about it already restricting the years @cmgosnell . For some reason I thought that the FERC records weren't restricted by the pudl_out start and end date years when I was looking at it before.

These are the coverage stats on 5 years of data:

2023-02-10 10:04:10 [    INFO] catalystcoop.pudl.analysis.ferc1_eia:827 Coverage for matches during EIA working years:
    Fuel type: 97.0%
    Tech type: 96.0%
Coverage for all steam table records during EIA working years:
    EIA matches: 87.8%
Coverage for all small gen table records during EIA working years:
    EIA matches: 49.9%
Coverage for all hydro table records during EIA working years:
    EIA matches: 91.7%
Coverage for all pumped storage table records during EIA working years:
    EIA matches: 91.0%
2023-02-10 10:04:10 [    INFO] catalystcoop.pudl.analysis.ferc1_eia:892 Matches with consistency across years of all matches is 88.1%
2023-02-10 10:04:10 [    INFO] catalystcoop.pudl.analysis.ferc1_eia:905 Matches with completely consistent FERC capacity have a consistency of 94.6%
2023-02-10 10:04:10 [    INFO] catalystcoop.pudl.analysis.ferc1_eia:892 Matches with consistency across years of overrides matches is 78.0%
2023-02-10 10:04:10 [    INFO] catalystcoop.pudl.analysis.ferc1_eia:905 Matches with completely consistent FERC capacity have a consistency of 89.8%

I think the only thing left to do is work out these merge conflicts so we can merge this into your branch. I think I left a comment or two on your original branch that we should take a look at before merging into dev.

@cmgosnell
Copy link
Member

cmgosnell commented Feb 10, 2023

heck yeah! This looks great and is very simplifying.

For some reason I thought that the FERC records weren't restricted by the pudl_out start and end date years when I was looking at it before.

this totally was the case! but we fixed it #2238

once you get the conflicts untangled.. which hopefully won't be too bad, I'll give it another quick looksee and smash that approve button

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wahoo!

@katie-lamb katie-lamb merged commit 40f0c4b into ferc_eia_rl Feb 11, 2023
@katie-lamb katie-lamb deleted the ferc-eia-rl-katie-rework branch February 11, 2023 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants