-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Operational data table eia861 #691
Conversation
…and O/I to True/False.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the one potential change/issue i see in here is with the duplication or records with the multiple nerc records.
total,63,58,30,30,30,30,30,29,29,29,29,30,30,30,30,30,30,30,30,30 | ||
data_type_o_=_observed_i_=_imputed,-1,-1,31,31,31,31,31,30,30,30,30,31,31,31,31,31,31,31,31,31 | ||
revenue_total,63,58,30,30,30,30,30,29,29,29,29,30,30,30,30,30,30,30,30,30 | ||
data_observed,-1,-1,31,31,31,31,31,30,30,30,30,31,31,31,31,31,31,31,31,31 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
src/pudl/transform/eia861.py
Outdated
df (pandas.DataFrame): A DataFrame with the column 'nerc_region' to be | ||
cleaned. | ||
idx_cols (list): A list of the primary keys. | ||
idx_no_nerc (list): A list of the primary keys, not including nerc_region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the idx_no_nerc
could be easily generated from idx_cols
, I might generate idx_no_nerc
inside this function so you don't need to pass it in.
something like:
idx_no_nerc = idx_cols
if "necr_region" in idx_cols:
idx_no_nerc.remove("necr_region")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the weirdest bug....when I run your code, nerc_region
gets removed from both idx_no_nerc
AND idx_cols
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed by adding idx_cols.copy()
, strange!
src/pudl/transform/eia861.py
Outdated
.fillna('UNK') | ||
.str.findall(r'[A-Z]+'))), | ||
multiple_nercs=(lambda x: ( | ||
x.nerc_region.apply(lambda x: len(x) > 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this trying to get a boolean column? It might be faster to use np.where
where instead of an apply. Something like:
multiple_nercs = lambda x: np.where((len(x.nerc_region) > 1), True, False)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one was complicated. In this case, x is actually actually temporary list within the column cell. The function reads the moosh of data in the nerc_region
column: if it's something like SPP
then it becomes ['SPP']
but if it's something like SPP & ERCOT
it becomes ['SPP', 'ERCOT']
. The len(x) > 1
is looking for instances where there are more than one NERC region in a given row (i.e., the list is greater than 1) and then splits those into multiple rows so that there is only one nerc region per row. The single nerc values are then turned to strings and validated against recognized nerc regions. If you can think of a more efficient way to do this I would love to know! Right now it needs to be a list rather than a boolean so I can extract the region name later when the rows are separated. I'll keep poking at it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, looking at it again, I think your suggestion might still work? I'll have to take a closer look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do not feel obligated to change this, because it definitely looks like it works as is!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the function still needs an apply to reach the actual cell value as opposed to the series. I was able to change it to df.nerc_region.apply(lambda x: np.where((len(x) > 1), True, False))
but I'm not sure that makes a big difference.
src/pudl/transform/eia861.py
Outdated
@@ -1407,6 +1505,7 @@ def non_net_metering(tfr_dfs): | |||
keep_totals=True | |||
) | |||
|
|||
tidy_nnm = tidy_nnm.drop_duplicates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need another drop duplicates in here? maybe this is left over from the og data duplicate mess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks for catching that! That was just my interim solution before merging with your sprint19 changes.
src/pudl/transform/eia861.py
Outdated
.unstack(-2) | ||
.reset_index(-1, drop=True) | ||
.reset_index() | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My one fear about this is that we are going to end up having multiple rows with different nerc regions but the same data. Although I'm not sure what the best thing would be here :-/
Do you know the scale of the problem here? like how many nerc_multiples
exist? How many unique bad strings/combos are there? If it is not too many we could flag, replace or null them. We should at least think through how to deal with these duplicate records if we keep them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Luckily, most instances of multiple entires in a single cell are comprised of a legitimate nerc region paired with a non legitimate nerc region. I don't have exact numbers, but essentially what this means is that we go through all the effort of splitting them apart only to delete the "fake" nerc row.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of generating these new records it might be better to break the records apart like you are currently, drop or clean the non-recognized codes, and when there are still multiple recognizable codes squish them back together in a standard way. like sort the multiple codes so they are all in the same order and squish them back into one string separated by -
or _
. This way you could break them apart easily and we would have a smaller number of these weird combo codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I was thinking about when I made this the first time around was whether I should keep at least a placeholder for the weird, non-nerc regions listed. In the first iteration I changed them to UNK. Do you think I should just drop them? I'll keep a copy of the old function in case we want to keep tweaking it.
src/pudl/transform/eia861.py
Outdated
# Backfill the utility with strange numeric nerc region | ||
nerc_df.loc[nerc_df.utility_id_eia == 55959, 'nerc_region'] = 'MRO' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit is throwing that weird loc statement error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. I think your problem is actually the line above nerc_df = df[idx_cols]
. If you change this to a loc
selection or make a copy it should stop the warning.
Also, I think you could probably add the numeric nerc region into the NERC_SPELLCHECK
. Not necessary, but it might this function slightly simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Thanks
@@ -1572,7 +1579,7 @@ def operational_data(tfr_dfs): | |||
########################################################################### | |||
|
|||
transformed_od = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed here that there are some column headers that could potentially but not necessarily be normalized. They all start with 'revenue' (below). Do you think I ought to normalize them too?
revenue_from_delivery_customers
revenue_from_sales_for_resale
revenue_from_credits_or_adjustments
revenue_from_transmission
revenue_from_other
revenue_total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like if we were to normalize these columns they'd need to be pulled out into a separate table. because the new idx_cols would be different for this set of columns than it would be for the rest. Is that true?
I suppose it would be pretty easy to grab just these revenue columns, use _tidy_class_dfs
and save two tables to tfr_dfs
. And that way you'd only have to turn 1000s of dollars back into dollars on one column.
That seems like it might be nice, but you are more familiar with this table than I am.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an important point because there are some other tables where normalization only applies to some of the columns. For instance, the net_metering
and non_net_metering
tables use of fuel_class
as a normalizer does not apply to every column. Is this something that should be corrected? Happy to hop on the phone to explain.
regarding index values, I think they would be the same? but I might not be fully understanding it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 minor comments about nerc region assignment. and lmk what you think about the normalization idea for the revenue.
src/pudl/transform/eia861.py
Outdated
# Backfill the utility with strange numeric nerc region | ||
nerc_df.loc[nerc_df.utility_id_eia == 55959, 'nerc_region'] = 'MRO' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh. I think your problem is actually the line above nerc_df = df[idx_cols]
. If you change this to a loc
selection or make a copy it should stop the warning.
Also, I think you could probably add the numeric nerc region into the NERC_SPELLCHECK
. Not necessary, but it might this function slightly simpler.
non_nerc_list = [ | ||
nerc_entity for nerc_entity in nerc_list if nerc_entity not in RECOGNIZED_NERC_REGIONS + list(NERC_SPELLCHECK.keys())] | ||
logger.info( | ||
f'The following reported NERC regions are not currently recognized and become UNK values: {non_nerc_list}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like what you are doing here! I know you said it feels clunky but I actually think it is very clear for the problem you are trying to solve.
nerc_df['nerc_region'] = ( | ||
nerc_df['nerc_region'] | ||
.apply(lambda x: x if x in RECOGNIZED_NERC_REGIONS else 'UNK') | ||
.apply(lambda x: [i if i not in NERC_SPELLCHECK.keys() else NERC_SPELLCHECK[i] for i in x]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this one might be cleaner and faster to use replace. It looks like your NERC_SPELLCHECK
is already perfectly set up for it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that'd be great! I'm not sure I see how to incorporate replace, though. It's a little tricky because it's a column of lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oooh! I forgot about the column of lists thing! ignore this then :-)
@@ -1572,7 +1579,7 @@ def operational_data(tfr_dfs): | |||
########################################################################### | |||
|
|||
transformed_od = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like if we were to normalize these columns they'd need to be pulled out into a separate table. because the new idx_cols would be different for this set of columns than it would be for the rest. Is that true?
I suppose it would be pretty easy to grab just these revenue columns, use _tidy_class_dfs
and save two tables to tfr_dfs
. And that way you'd only have to turn 1000s of dollars back into dollars on one column.
That seems like it might be nice, but you are more familiar with this table than I am.
… dictionary; add .copy() statement to avoid slicing errors.
…utput for operational_data function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks awesome!
src/pudl/transform/eia861.py
Outdated
retail_sales_revenue=lambda x: x.retail_sales_revenue * 1000.0, | ||
sales_for_resale_revenue=lambda x: x.sales_for_resale_revenue * 1000.0, | ||
transmission_revenue=lambda x: x.transmission_revenue * 1000.0, | ||
total_revenue=lambda x: x.total_revenue * 1000.0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all of these columns will be squished in one via _tidy_class_dfs
, could we run this assign on just one column once these are squished?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a great point!
src/pudl/transform/eia861.py
Outdated
|
||
tfr_dfs["operational_data_eia861"] = transformed_od | ||
tfr_dfs["operational_data_revenue_eia861"] = tidy_od_rev | ||
tfr_dfs["operational_data_other_eia861"] = transformed_od_other |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole process looks great!
# underscores. Final string looks like: '(?<=customer_test)_|(?<=unbundled)_' | ||
# This ensures that the underscore AFTER the desired string (that can also include underscores) | ||
# is where the column headers are split, not just the first underscore. | ||
class_list_regex = '|'.join(['(?<=' + col + ')_' for col in class_list]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great! just one line for so much extra simplicity in the column names 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I'm going to change some stuff in the net_metering/non_net_metering files.
…erscores in their names.
…ables and Operational Data Table; change some column headers to accomodate wide-to-tall formatting and class deliniation.
Codecov Report
@@ Coverage Diff @@
## sprint20 #691 +/- ##
============================================
+ Coverage 73.47% 73.60% +0.13%
============================================
Files 39 39
Lines 4490 4610 +120
============================================
+ Hits 3299 3393 +94
- Misses 1191 1217 +26
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks very good! I have no code specific comments, really.
This is mostly a next step question: do you have a sense of how we should handle these new misc tables. Or like, some way of keeping track of the ones that could be squished together. I am just forgetting where we came down on these little stragglers.
It didn't appear that this table needed any normalization, but let me know if you think otherwise! The biggest changes in this PR are:
_clean_nerc_add_rows()
functionoperational_data
datacolumn_dtype
dictionary in constants.