-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework of FERC to EIA logistic regression model #2276
Conversation
Codecov ReportBase: 86.0% // Head: 85.8% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## ferc_eia_rl #2276 +/- ##
=============================================
- Coverage 86.0% 85.8% -0.2%
=============================================
Files 74 74
Lines 9334 9237 -97
=============================================
- Hits 8030 7931 -99
- Misses 1304 1306 +2
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
src/pudl/analysis/ferc1_eia.py
Outdated
so weight the feature vecotrs. With the sum of the weighted feature vectors for | ||
each model match, this method selects the hightest scoring match via | ||
:meth:`calc_best_matches()`. | ||
def remove_murky_matches(match_df, difference_threshold): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, you were choosing a best match based on weighting the feature vectors by their coefficients and finding the interquartile range. Do you remember if there was a reason to do this method for selecting the best FERC to EIA match? Also, if I understood correctly, the murky matches were just being dropped even if it means that a FERC record is left with no match?
Here, I thought I'd experiment a little with a simpler method. For each FERC record, I choose the highest probability EIA record match as long as the difference between that record and the next highest probability is over a set threshold. But happy to go back to the interquartile range. It might perform slightly better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the answer to this also depends on how certain you want matches to be. And that can also be dictated by the thresholds that are set.
@@ -1263,7 +809,7 @@ def _get_subtable(table_name): | |||
return r_eia_years[r_eia_years.record_id_ferc1.str.contains(f"{table_name}")] | |||
|
|||
def _get_match_pct(df): | |||
return round(len(df[df["record_id_eia"].notna()]) / len(df)) | |||
return len(df[df["record_id_eia"].notna()]) / len(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This round
was rounding to either 1 or 0 since the number it's rounding is between 0 and 1. You can get rid of this round entirely since the logged coverage below is rounding to .01 in the f string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this comment here since I can't comment on the actual line itself.
In the below coverage stats, the coverage for each of the FERC tables is printing the coverage percentage for all years of EIA data, not just the subset that was included in the pudl_out
object that generated the plant parts list for this model run. So when the CI runs, the coverage numbers are going to be low because they were only ever looking at 5 years of data. But I guess the numbers are accurate, because with 5 years of data the coverage across all EIA working years should be low. It's just not a very helpful stat, but maybe it doesn't really matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh good! i noticed that these outputs were all 0% when i moved it over and it was wigging me out but i hadn't investigated why yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks awesome! It'd be great if you could add some arg docs but honestly besides that it looks great. i appreciate your actual knowledge of sklearn
🤣
src/pudl/analysis/ferc1_eia.py
Outdated
match_df = match_df[ | ||
(match_df["count"] == 1) | match_df["diff"] >= difference_threshold | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the thing that this doesn't do which i was previously doing is communicating how many of the best matches are outside of the difference_threshold
. iirc I used the murk and ties very early on to identify the need to generate the true_gran
flag to remove those records that were basically the same that the model couldn't distinguish. So in retrospect I'm not totally sure this is needed anymore... but the last time I ran the model I was still getting ~6% of records that we effectively weren't choosing as matches because they didn't meet this difference threshold.
best match v ferc: 65.46%
best match vs matches:91.10%
murk vs matches: 5.52%
ties vs matches: 1.17%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay I didn't get what these stats meant before.
I added "Percent of predicted matches that have a best match" and "Percent of matches not meeting match probability difference threshold".
@cmgosnell I added docstrings, got the CI to pass, and put in stats about the number of murky matches that are thrown out. So I think this PR is pretty much ready to go. Two little lingering questions: In the In the original model, "murky" matches were EIA records that were not distinct enough from the next best EIA match. In this variation, "murky" matches are EIA records whose probability of being a match is not a big enough distinction from the next best match i.e. 91% probability of match versus 90% probability of match. When we're looking at differences in probability, not differences in the actual feature vectors, does it still make sense to throw out "murky" matches? Should I go back to looking at the difference in the distinctness of the feature vectors, since that aligns more with the original intention of "murkiness"? |
@katie-lamb ty for all of this!! I don't have straightforward answers to your questions 🙃 coverageI thiiiink the coverage logging is okay(-ish)! The input to it is I could very much be wrong about that and I'm a little surprised that the steam table's matches are so low in this pr's action logs:
murkI think your right that this new version of murk is less like the old version of murk. I was led down the path of determining the distinctiveness of the # 1 and # 2 winning-iest match because it felt like the only way to know whether there were two records that were too similar in the models eyes. I'm not sure if this is the right way to think about it at all. I could certainly be persuaded that we should just always take the # 1 most probable result. In my last run of the rl before these changes we were dropping ~6% of matches bc they were murky wins, but it looks like your version of murk drops way more:
|
Yep, I should have been more clear about my question. To use an example, if |
I think when trying to understand how the model was thinking about similar matches and distinguishing between records we run the risk of oversimplifying it. Certainly Logistic Regression is fairly intuitive, but I still think we should use a metric that the model provides. It seems more defensible and ultimately the model is making a decision about what is a match and it's what not, so we shouldn't add that many layers of our own decision making on top of that. I could be persuaded to get rid of the concept of "murky matches" entirely as well and just use the highest scoring match.
This percentage of murky matches being dropped is really high because the CI is currently only running on 2 years of data (2020 and 2021, I left a comment about this in your original PR). I think this results in the model overfitting, and I don't think these numbers are a good representation because the data is so limited with only 2 years. When I run it on 2015 - 2020 it drops 10% of matches, and when I bump the score threshold down to |
Update update:
Stats on 5 years of data without the murky matches removed (seems pretty much the same):
These are the coverage stats on 5 years of data:
I think the only thing left to do is work out these merge conflicts so we can merge this into your branch. I think I left a comment or two on your original branch that we should take a look at before merging into |
heck yeah! This looks great and is very simplifying.
this totally was the case! but we fixed it #2238 once you get the conflicts untangled.. which hopefully won't be too bad, I'll give it another quick looksee and smash that approve button |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wahoo!
PR Overview
There are still some fixes that need to be made (see below check list) but feel free to take a look since there are a bunch of changes.
The main benefit of these changes is the speed increase. On my machine it takes me 25 seconds to run all of
execute
. But almost all of this speed increase comes from replacingModelTuner
withGridSearchCV
. The other stuff that I rewrote (MatchManager
functions to override best matches and remove murky matches) is maybe just a nicer way to do things. If it's too much change, then we can just use theGridSearchCV
model change.On 5 years of data (2015 - 2020) I get the following stats from the best model on the validation set:
Accuracy: 0.88
F-Score: 0.92
Precision: 0.9
Recall: 0.94
Percent of training data matches correctly predicted: 0.86
Percent of training data overwritten in matches: 0.083
When I tried to run the original model (in Christina's branch) for comparison the
recordlinkage
model was going too crazy and crashed my computer LOL so I need to go back and run that. For some reason that model won't work for me and I can't get this logger to silence itself so it's hard to debug.Places That Changed Significantly
I changed the way that one-to-many FERC to EIA matches are reduced to one match. See my comments around the
remove_murky_matches
functions for more specifics. I wasn't sure if there were very intentional reasons why you removed murky matches in the way that you did or if this could be done in a simpler way. I think this also depends on the degree of certainty required for the matches.Remaining To Do
MatchManager
function might be relevant. But maybe it's simple enough to have as three separate functions.dev
intoferc_eia_rl
and then subsequently intoferc-eia-rl-katie-rework
.round
.PR Checklist
dev
).