Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operational data table eia861 #691

Merged
merged 16 commits into from
Aug 6, 2020
Merged

Operational data table eia861 #691

merged 16 commits into from
Aug 6, 2020

Conversation

aesharpe
Copy link
Member

It didn't appear that this table needed any normalization, but let me know if you think otherwise! The biggest changes in this PR are:

  • Addition of _clean_nerc_add_rows() function
  • Transformation of additional operational_data data
  • Addition of operational_data and non_net_metering column dtypes to column_dtype dictionary in constants.

@aesharpe aesharpe added the eia861 Anything having to do with EIA Form 861 label Jul 16, 2020
@aesharpe aesharpe added this to the PUDL Sprint 19 milestone Jul 16, 2020
@aesharpe aesharpe requested a review from cmgosnell July 16, 2020 20:49
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the one potential change/issue i see in here is with the duplication or records with the multiple nerc records.

total,63,58,30,30,30,30,30,29,29,29,29,30,30,30,30,30,30,30,30,30
data_type_o_=_observed_i_=_imputed,-1,-1,31,31,31,31,31,30,30,30,30,31,31,31,31,31,31,31,31,31
revenue_total,63,58,30,30,30,30,30,29,29,29,29,30,30,30,30,30,30,30,30,30
data_observed,-1,-1,31,31,31,31,31,30,30,30,30,31,31,31,31,31,31,31,31,31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

df (pandas.DataFrame): A DataFrame with the column 'nerc_region' to be
cleaned.
idx_cols (list): A list of the primary keys.
idx_no_nerc (list): A list of the primary keys, not including nerc_region.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the idx_no_nerc could be easily generated from idx_cols, I might generate idx_no_nerc inside this function so you don't need to pass it in.

something like:

idx_no_nerc = idx_cols
if "necr_region" in idx_cols:
    idx_no_nerc.remove("necr_region")

Copy link
Member Author

@aesharpe aesharpe Jul 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the weirdest bug....when I run your code, nerc_region gets removed from both idx_no_nerc AND idx_cols....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed by adding idx_cols.copy(), strange!

.fillna('UNK')
.str.findall(r'[A-Z]+'))),
multiple_nercs=(lambda x: (
x.nerc_region.apply(lambda x: len(x) > 1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trying to get a boolean column? It might be faster to use np.where where instead of an apply. Something like:

multiple_nercs = lambda x: np.where((len(x.nerc_region) > 1), True, False)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one was complicated. In this case, x is actually actually temporary list within the column cell. The function reads the moosh of data in the nerc_region column: if it's something like SPP then it becomes ['SPP'] but if it's something like SPP & ERCOT it becomes ['SPP', 'ERCOT']. The len(x) > 1 is looking for instances where there are more than one NERC region in a given row (i.e., the list is greater than 1) and then splits those into multiple rows so that there is only one nerc region per row. The single nerc values are then turned to strings and validated against recognized nerc regions. If you can think of a more efficient way to do this I would love to know! Right now it needs to be a list rather than a boolean so I can extract the region name later when the rows are separated. I'll keep poking at it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, looking at it again, I think your suggestion might still work? I'll have to take a closer look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do not feel obligated to change this, because it definitely looks like it works as is!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the function still needs an apply to reach the actual cell value as opposed to the series. I was able to change it to df.nerc_region.apply(lambda x: np.where((len(x) > 1), True, False)) but I'm not sure that makes a big difference.

@@ -1407,6 +1505,7 @@ def non_net_metering(tfr_dfs):
keep_totals=True
)

tidy_nnm = tidy_nnm.drop_duplicates()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need another drop duplicates in here? maybe this is left over from the og data duplicate mess?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thanks for catching that! That was just my interim solution before merging with your sprint19 changes.

.unstack(-2)
.reset_index(-1, drop=True)
.reset_index()
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My one fear about this is that we are going to end up having multiple rows with different nerc regions but the same data. Although I'm not sure what the best thing would be here :-/

Do you know the scale of the problem here? like how many nerc_multiples exist? How many unique bad strings/combos are there? If it is not too many we could flag, replace or null them. We should at least think through how to deal with these duplicate records if we keep them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Luckily, most instances of multiple entires in a single cell are comprised of a legitimate nerc region paired with a non legitimate nerc region. I don't have exact numbers, but essentially what this means is that we go through all the effort of splitting them apart only to delete the "fake" nerc row.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of generating these new records it might be better to break the records apart like you are currently, drop or clean the non-recognized codes, and when there are still multiple recognizable codes squish them back together in a standard way. like sort the multiple codes so they are all in the same order and squish them back into one string separated by - or _. This way you could break them apart easily and we would have a smaller number of these weird combo codes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I was thinking about when I made this the first time around was whether I should keep at least a placeholder for the weird, non-nerc regions listed. In the first iteration I changed them to UNK. Do you think I should just drop them? I'll keep a copy of the old function in case we want to keep tweaking it.

Comment on lines 728 to 730
# Backfill the utility with strange numeric nerc region
nerc_df.loc[nerc_df.utility_id_eia == 55959, 'nerc_region'] = 'MRO'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This bit is throwing that weird loc statement error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. I think your problem is actually the line above nerc_df = df[idx_cols]. If you change this to a loc selection or make a copy it should stop the warning.

Also, I think you could probably add the numeric nerc region into the NERC_SPELLCHECK. Not necessary, but it might this function slightly simpler.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Thanks

@@ -1572,7 +1579,7 @@ def operational_data(tfr_dfs):
###########################################################################

transformed_od = (
Copy link
Member Author

@aesharpe aesharpe Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed here that there are some column headers that could potentially but not necessarily be normalized. They all start with 'revenue' (below). Do you think I ought to normalize them too?

revenue_from_delivery_customers
revenue_from_sales_for_resale
revenue_from_credits_or_adjustments
revenue_from_transmission
revenue_from_other
revenue_total

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like if we were to normalize these columns they'd need to be pulled out into a separate table. because the new idx_cols would be different for this set of columns than it would be for the rest. Is that true?

I suppose it would be pretty easy to grab just these revenue columns, use _tidy_class_dfs and save two tables to tfr_dfs. And that way you'd only have to turn 1000s of dollars back into dollars on one column.

That seems like it might be nice, but you are more familiar with this table than I am.

Copy link
Member Author

@aesharpe aesharpe Jul 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an important point because there are some other tables where normalization only applies to some of the columns. For instance, the net_metering and non_net_metering tables use of fuel_class as a normalizer does not apply to every column. Is this something that should be corrected? Happy to hop on the phone to explain.
regarding index values, I think they would be the same? but I might not be fully understanding it.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 minor comments about nerc region assignment. and lmk what you think about the normalization idea for the revenue.

Comment on lines 728 to 730
# Backfill the utility with strange numeric nerc region
nerc_df.loc[nerc_df.utility_id_eia == 55959, 'nerc_region'] = 'MRO'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. I think your problem is actually the line above nerc_df = df[idx_cols]. If you change this to a loc selection or make a copy it should stop the warning.

Also, I think you could probably add the numeric nerc region into the NERC_SPELLCHECK. Not necessary, but it might this function slightly simpler.

non_nerc_list = [
nerc_entity for nerc_entity in nerc_list if nerc_entity not in RECOGNIZED_NERC_REGIONS + list(NERC_SPELLCHECK.keys())]
logger.info(
f'The following reported NERC regions are not currently recognized and become UNK values: {non_nerc_list}')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like what you are doing here! I know you said it feels clunky but I actually think it is very clear for the problem you are trying to solve.

nerc_df['nerc_region'] = (
nerc_df['nerc_region']
.apply(lambda x: x if x in RECOGNIZED_NERC_REGIONS else 'UNK')
.apply(lambda x: [i if i not in NERC_SPELLCHECK.keys() else NERC_SPELLCHECK[i] for i in x])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one might be cleaner and faster to use replace. It looks like your NERC_SPELLCHECK is already perfectly set up for it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that'd be great! I'm not sure I see how to incorporate replace, though. It's a little tricky because it's a column of lists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oooh! I forgot about the column of lists thing! ignore this then :-)

@@ -1572,7 +1579,7 @@ def operational_data(tfr_dfs):
###########################################################################

transformed_od = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like if we were to normalize these columns they'd need to be pulled out into a separate table. because the new idx_cols would be different for this set of columns than it would be for the rest. Is that true?

I suppose it would be pretty easy to grab just these revenue columns, use _tidy_class_dfs and save two tables to tfr_dfs. And that way you'd only have to turn 1000s of dollars back into dollars on one column.

That seems like it might be nice, but you are more familiar with this table than I am.

Austen Sharpe added 2 commits July 21, 2020 17:41
… dictionary; add .copy() statement to avoid slicing errors.
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks awesome!

retail_sales_revenue=lambda x: x.retail_sales_revenue * 1000.0,
sales_for_resale_revenue=lambda x: x.sales_for_resale_revenue * 1000.0,
transmission_revenue=lambda x: x.transmission_revenue * 1000.0,
total_revenue=lambda x: x.total_revenue * 1000.0,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all of these columns will be squished in one via _tidy_class_dfs, could we run this assign on just one column once these are squished?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great point!


tfr_dfs["operational_data_eia861"] = transformed_od
tfr_dfs["operational_data_revenue_eia861"] = tidy_od_rev
tfr_dfs["operational_data_other_eia861"] = transformed_od_other
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whole process looks great!

# underscores. Final string looks like: '(?<=customer_test)_|(?<=unbundled)_'
# This ensures that the underscore AFTER the desired string (that can also include underscores)
# is where the column headers are split, not just the first underscore.
class_list_regex = '|'.join(['(?<=' + col + ')_' for col in class_list])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is great! just one line for so much extra simplicity in the column names 👍

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I'm going to change some stuff in the net_metering/non_net_metering files.

@aesharpe aesharpe modified the milestones: PUDL Sprint 19, PUDL Sprint 20 Jul 24, 2020
@zaneselvans zaneselvans changed the base branch from sprint19 to sprint20 July 29, 2020 14:26
Austen Sharpe added 3 commits July 31, 2020 10:06
…ables and Operational Data Table; change some column headers to accomodate wide-to-tall formatting and class deliniation.
@codecov
Copy link

codecov bot commented Aug 3, 2020

Codecov Report

Merging #691 into sprint20 will increase coverage by 0.13%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##           sprint20     #691      +/-   ##
============================================
+ Coverage     73.47%   73.60%   +0.13%     
============================================
  Files            39       39              
  Lines          4490     4610     +120     
============================================
+ Hits           3299     3393      +94     
- Misses         1191     1217      +26     
Impacted Files Coverage Δ
src/pudl/constants.py 100.00% <100.00%> (ø)
src/pudl/transform/eia861.py 96.28% <100.00%> (+1.24%) ⬆️
src/pudl/workspace/datastore.py 59.82% <0.00%> (+0.47%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efa909c...6240cad. Read the comment docs.

Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very good! I have no code specific comments, really.

This is mostly a next step question: do you have a sense of how we should handle these new misc tables. Or like, some way of keeping track of the ones that could be squished together. I am just forgetting where we came down on these little stragglers.

@aesharpe aesharpe merged commit 55d0327 into sprint20 Aug 6, 2020
@aesharpe aesharpe deleted the operational_data_eia861 branch August 18, 2020 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
eia861 Anything having to do with EIA Form 861
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants