Operational data table eia861 #691

aesharpe · 2020-07-16T20:49:06Z

It didn't appear that this table needed any normalization, but let me know if you think otherwise! The biggest changes in this PR are:

Addition of _clean_nerc_add_rows() function
Transformation of additional operational_data data
Addition of operational_data and non_net_metering column dtypes to column_dtype dictionary in constants.

… strings.

…and O/I to True/False.

…able

cmgosnell

the one potential change/issue i see in here is with the duplication or records with the multiple nerc records.

cmgosnell · 2020-07-17T14:43:24Z

src/pudl/package_data/meta/xlsx_maps/eia861/column_maps/operational_data_eia861.csv

-total,63,58,30,30,30,30,30,29,29,29,29,30,30,30,30,30,30,30,30,30
-data_type_o_=_observed_i_=_imputed,-1,-1,31,31,31,31,31,30,30,30,30,31,31,31,31,31,31,31,31,31
+revenue_total,63,58,30,30,30,30,30,29,29,29,29,30,30,30,30,30,30,30,30,30
+data_observed,-1,-1,31,31,31,31,31,30,30,30,30,31,31,31,31,31,31,31,31,31


cmgosnell · 2020-07-17T17:08:06Z

src/pudl/transform/eia861.py

+        df (pandas.DataFrame): A DataFrame with the column 'nerc_region' to be
+            cleaned.
+        idx_cols (list): A list of the primary keys.
+        idx_no_nerc (list): A list of the primary keys, not including nerc_region.


Because the idx_no_nerc could be easily generated from idx_cols, I might generate idx_no_nerc inside this function so you don't need to pass it in.

something like:

idx_no_nerc = idx_cols if "necr_region" in idx_cols: idx_no_nerc.remove("necr_region")

This seems like the weirdest bug....when I run your code, nerc_region gets removed from both idx_no_nerc AND idx_cols....

fixed by adding idx_cols.copy(), strange!

cmgosnell · 2020-07-17T17:22:13Z

src/pudl/transform/eia861.py

+                .fillna('UNK')
+                .str.findall(r'[A-Z]+'))),
+            multiple_nercs=(lambda x: (
+                x.nerc_region.apply(lambda x: len(x) > 1)))


Is this trying to get a boolean column? It might be faster to use np.where where instead of an apply. Something like:

multiple_nercs = lambda x: np.where((len(x.nerc_region) > 1), True, False)

This one was complicated. In this case, x is actually actually temporary list within the column cell. The function reads the moosh of data in the nerc_region column: if it's something like SPP then it becomes ['SPP'] but if it's something like SPP & ERCOT it becomes ['SPP', 'ERCOT']. The len(x) > 1 is looking for instances where there are more than one NERC region in a given row (i.e., the list is greater than 1) and then splits those into multiple rows so that there is only one nerc region per row. The single nerc values are then turned to strings and validated against recognized nerc regions. If you can think of a more efficient way to do this I would love to know! Right now it needs to be a list rather than a boolean so I can extract the region name later when the rows are separated. I'll keep poking at it.

But, looking at it again, I think your suggestion might still work? I'll have to take a closer look.

do not feel obligated to change this, because it definitely looks like it works as is!

It looks like the function still needs an apply to reach the actual cell value as opposed to the series. I was able to change it to df.nerc_region.apply(lambda x: np.where((len(x) > 1), True, False)) but I'm not sure that makes a big difference.

cmgosnell · 2020-07-17T17:27:12Z

src/pudl/transform/eia861.py

@@ -1407,6 +1505,7 @@ def non_net_metering(tfr_dfs):
        keep_totals=True
    )

+    tidy_nnm = tidy_nnm.drop_duplicates()


do we need another drop duplicates in here? maybe this is left over from the og data duplicate mess?

Yes, thanks for catching that! That was just my interim solution before merging with your sprint19 changes.

cmgosnell · 2020-07-17T18:34:19Z

src/pudl/transform/eia861.py

+        .unstack(-2)
+        .reset_index(-1, drop=True)
+        .reset_index()
+    )


My one fear about this is that we are going to end up having multiple rows with different nerc regions but the same data. Although I'm not sure what the best thing would be here :-/

Do you know the scale of the problem here? like how many nerc_multiples exist? How many unique bad strings/combos are there? If it is not too many we could flag, replace or null them. We should at least think through how to deal with these duplicate records if we keep them.

Luckily, most instances of multiple entires in a single cell are comprised of a legitimate nerc region paired with a non legitimate nerc region. I don't have exact numbers, but essentially what this means is that we go through all the effort of splitting them apart only to delete the "fake" nerc row.

Instead of generating these new records it might be better to break the records apart like you are currently, drop or clean the non-recognized codes, and when there are still multiple recognizable codes squish them back together in a standard way. like sort the multiple codes so they are all in the same order and squish them back into one string separated by - or _. This way you could break them apart easily and we would have a smaller number of these weird combo codes.

One thing I was thinking about when I made this the first time around was whether I should keep at least a placeholder for the weird, non-nerc regions listed. In the first iteration I changed them to UNK. Do you think I should just drop them? I'll keep a copy of the old function in case we want to keep tweaking it.

aesharpe · 2020-07-21T21:24:40Z

src/pudl/transform/eia861.py

+    # Backfill the utility with strange numeric nerc region
+    nerc_df.loc[nerc_df.utility_id_eia == 55959, 'nerc_region'] = 'MRO'
+


This bit is throwing that weird loc statement error.

Ahh. I think your problem is actually the line above nerc_df = df[idx_cols]. If you change this to a loc selection or make a copy it should stop the warning.

Also, I think you could probably add the numeric nerc region into the NERC_SPELLCHECK. Not necessary, but it might this function slightly simpler.

Good point! Thanks

aesharpe · 2020-07-21T21:33:25Z

src/pudl/transform/eia861.py

@@ -1572,7 +1579,7 @@ def operational_data(tfr_dfs):
    ###########################################################################

    transformed_od = (


I noticed here that there are some column headers that could potentially but not necessarily be normalized. They all start with 'revenue' (below). Do you think I ought to normalize them too?

revenue_from_delivery_customers
revenue_from_sales_for_resale
revenue_from_credits_or_adjustments
revenue_from_transmission
revenue_from_other
revenue_total

it looks like if we were to normalize these columns they'd need to be pulled out into a separate table. because the new idx_cols would be different for this set of columns than it would be for the rest. Is that true?

I suppose it would be pretty easy to grab just these revenue columns, use _tidy_class_dfs and save two tables to tfr_dfs. And that way you'd only have to turn 1000s of dollars back into dollars on one column.

That seems like it might be nice, but you are more familiar with this table than I am.

This is an important point because there are some other tables where normalization only applies to some of the columns. For instance, the net_metering and non_net_metering tables use of fuel_class as a normalizer does not apply to every column. Is this something that should be corrected? Happy to hop on the phone to explain.
regarding index values, I think they would be the same? but I might not be fully understanding it.

cmgosnell

👍 minor comments about nerc region assignment. and lmk what you think about the normalization idea for the revenue.

cmgosnell · 2020-07-21T22:25:23Z

src/pudl/transform/eia861.py

+    # Backfill the utility with strange numeric nerc region
+    nerc_df.loc[nerc_df.utility_id_eia == 55959, 'nerc_region'] = 'MRO'
+


Ahh. I think your problem is actually the line above nerc_df = df[idx_cols]. If you change this to a loc selection or make a copy it should stop the warning.

Also, I think you could probably add the numeric nerc region into the NERC_SPELLCHECK. Not necessary, but it might this function slightly simpler.

cmgosnell · 2020-07-21T22:31:12Z

src/pudl/transform/eia861.py

+    non_nerc_list = [
+        nerc_entity for nerc_entity in nerc_list if nerc_entity not in RECOGNIZED_NERC_REGIONS + list(NERC_SPELLCHECK.keys())]
+    logger.info(
+        f'The following reported NERC regions are not currently recognized and become UNK values: {non_nerc_list}')


I really like what you are doing here! I know you said it feels clunky but I actually think it is very clear for the problem you are trying to solve.

cmgosnell · 2020-07-21T22:32:13Z

src/pudl/transform/eia861.py

    nerc_df['nerc_region'] = (
        nerc_df['nerc_region']
-        .apply(lambda x: x if x in RECOGNIZED_NERC_REGIONS else 'UNK')
+        .apply(lambda x: [i if i not in NERC_SPELLCHECK.keys() else NERC_SPELLCHECK[i] for i in x])


this one might be cleaner and faster to use replace. It looks like your NERC_SPELLCHECK is already perfectly set up for it

Hmm that'd be great! I'm not sure I see how to incorporate replace, though. It's a little tricky because it's a column of lists.

oooh! I forgot about the column of lists thing! ignore this then :-)

cmgosnell · 2020-07-21T22:45:33Z

src/pudl/transform/eia861.py

@@ -1572,7 +1579,7 @@ def operational_data(tfr_dfs):
    ###########################################################################

    transformed_od = (


it looks like if we were to normalize these columns they'd need to be pulled out into a separate table. because the new idx_cols would be different for this set of columns than it would be for the rest. Is that true?

I suppose it would be pretty easy to grab just these revenue columns, use _tidy_class_dfs and save two tables to tfr_dfs. And that way you'd only have to turn 1000s of dollars back into dollars on one column.

That seems like it might be nice, but you are more familiar with this table than I am.

… dictionary; add .copy() statement to avoid slicing errors.

…utput for operational_data function.

cmgosnell

this looks awesome!

cmgosnell · 2020-07-23T15:39:26Z

src/pudl/transform/eia861.py

+            retail_sales_revenue=lambda x: x.retail_sales_revenue * 1000.0,
+            sales_for_resale_revenue=lambda x: x.sales_for_resale_revenue * 1000.0,
+            transmission_revenue=lambda x: x.transmission_revenue * 1000.0,
+            total_revenue=lambda x: x.total_revenue * 1000.0,


If all of these columns will be squished in one via _tidy_class_dfs, could we run this assign on just one column once these are squished?

That's a great point!

cmgosnell · 2020-07-23T15:40:33Z

src/pudl/transform/eia861.py


-    tfr_dfs["operational_data_eia861"] = transformed_od
+    tfr_dfs["operational_data_revenue_eia861"] = tidy_od_rev
+    tfr_dfs["operational_data_other_eia861"] = transformed_od_other


this whole process looks great!

cmgosnell · 2020-07-23T15:41:41Z

src/pudl/transform/eia861.py

+    # underscores. Final string looks like: '(?<=customer_test)_|(?<=unbundled)_'
+    # This ensures that the underscore AFTER the desired string (that can also include underscores)
+    # is where the column headers are split, not just the first underscore.
+    class_list_regex = '|'.join(['(?<=' + col + ')_' for col in class_list])


this is great! just one line for so much extra simplicity in the column names 👍

Great! I'm going to change some stuff in the net_metering/non_net_metering files.

…erscores in their names.

…icalDtypes()

…ables and Operational Data Table; change some column headers to accomodate wide-to-tall formatting and class deliniation.

codecov · 2020-08-03T21:22:32Z

Codecov Report

Merging #691 into sprint20 will increase coverage by 0.13%.
The diff coverage is 100.00%.

@@             Coverage Diff              @@
##           sprint20     #691      +/-   ##
============================================
+ Coverage     73.47%   73.60%   +0.13%     
============================================
  Files            39       39              
  Lines          4490     4610     +120     
============================================
+ Hits           3299     3393      +94     
- Misses         1191     1217      +26

Impacted Files	Coverage Δ
src/pudl/constants.py	`100.00% <100.00%> (ø)`
src/pudl/transform/eia861.py	`96.28% <100.00%> (+1.24%)`	⬆️
src/pudl/workspace/datastore.py	`59.82% <0.00%> (+0.47%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update efa909c...6240cad. Read the comment docs.

cmgosnell

This looks very good! I have no code specific comments, really.

This is mostly a next step question: do you have a sense of how we should handle these new misc tables. Or like, some way of keeping track of the ones that could be squished together. I am just forgetting where we came down on these little stragglers.

Austen Sharpe added 7 commits July 10, 2020 15:29

Beginning opperational data table - contending with dirty NERC region…

9817125

… strings.

fix merge conflicts

e146957

merge with sprint18

bd5d59a

add _clean_nerc_add_rows() function to handle wonky nerc inputs.

11fb928

added operational_data columns to constants, transformed 1000$ to 1$ …

85a871e

…and O/I to True/False.

Merge branch 'sprint19' into operational_data_eia861

2c53219

Added column names and dtypes to column_dtypes for non_net_metering t…

4bfd69e

…able

aesharpe added the eia861 Anything having to do with EIA Form 861 label Jul 16, 2020

aesharpe added this to the PUDL Sprint 19 milestone Jul 16, 2020

aesharpe requested a review from cmgosnell July 16, 2020 20:49

cmgosnell requested changes Jul 17, 2020

View reviewed changes

Austen Sharpe added 2 commits July 17, 2020 13:20

merge with sprint19 updates

c528e68

respond to PR comments: revamp nerc_region cleaning

fb79681

aesharpe commented Jul 21, 2020

View reviewed changes

cmgosnell reviewed Jul 21, 2020

View reviewed changes

Austen Sharpe added 2 commits July 21, 2020 17:41

address PR comments: move numeric nerc_region exception to spellcheck…

8cb6fe0

… dictionary; add .copy() statement to avoid slicing errors.

address PR comments: add regex component to _tidy_class_dfs(); dual o…

7b26d35

…utput for operational_data function.

cmgosnell reviewed Jul 23, 2020

View reviewed changes

Austen Sharpe added 2 commits July 23, 2020 14:38

upated fuel/customer class names to reflect new capacity to have und…

a34a010

…erscores in their names.

move module variables to constants and connect them to column Categor…

f63f946

…icalDtypes()

aesharpe modified the milestones: PUDL Sprint 19, PUDL Sprint 20 Jul 24, 2020

zaneselvans changed the base branch from sprint19 to sprint20 July 29, 2020 14:26

Austen Sharpe added 3 commits July 31, 2020 10:06

small PR changes

51ed98d

merge with sprint20

2da83b0

Address PR Comments: Accomodate dual outputs for (Non) Net Metering T…

6240cad

…ables and Operational Data Table; change some column headers to accomodate wide-to-tall formatting and class deliniation.

cmgosnell approved these changes Aug 5, 2020

View reviewed changes

aesharpe merged commit 55d0327 into sprint20 Aug 6, 2020

aesharpe deleted the operational_data_eia861 branch August 18, 2020 16:39

		# Backfill the utility with strange numeric nerc region
		nerc_df.loc[nerc_df.utility_id_eia == 55959, 'nerc_region'] = 'MRO'

		@@ -1572,7 +1579,7 @@ def operational_data(tfr_dfs):
		###########################################################################

		transformed_od = (

Operational data table eia861 #691

Operational data table eia861 #691

Conversation

aesharpe commented Jul 16, 2020

cmgosnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe Jul 17, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe Jul 21, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aesharpe Jul 21, 2020 • edited Loading

Choose a reason for hiding this comment

cmgosnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cmgosnell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 3, 2020 • edited Loading

Codecov Report

cmgosnell left a comment

Choose a reason for hiding this comment

aesharpe Jul 17, 2020 •

edited

Loading

aesharpe Jul 21, 2020 •

edited

Loading

aesharpe Jul 21, 2020 •

edited

Loading

codecov bot commented Aug 3, 2020 •

edited

Loading