Implement drop_invalid_rows() for fuel_ferc1 table #1903

zaneselvans · 2022-09-04T20:39:18Z

Create a fuel_ferc1 table-specific drop_invalid_rows() method, which both makes use of the parameterized method inherited from the AbstractTableTransformer, and adds a separate method for identifying rows which we believe to be plant totals, rather than rows that pertain to individual fuels.

This results in dropping more than 2/3 of all the records in the fuel_ferc1 table (around 100,000 rows).

I also made some name changes, trying to be a little more standard and (hopefully) informative:

remove_invalid_rows() => drop_invalid_rows(), since there were already several methods defined in the FERC 1 table transformers that do similar things in more specific circumstances, and they were named drop_*_cols() or drop_*_rows().
cols_to_check => required_valid_cols and cols_to_not_check => allowed_invalid_cols to provide some indication as to what is being checked for (validity / invalidity) in relation to the other transform parameter (invalid_values). These could still be improved though I think.
DropInvalidRows => InvalidRows to follow the convention of the other TransformParams which are nouns describing their contents, while the names of the methods & functions that they parameterize describe the actions they take with verbs.
Moved the documentation of the InvalidRows parameters into the class definition.

Closes #1853

Create a fuel_ferc1 table specific drop_invalid_rows method, which both makes use of the parameterized method inherited from the AbstractTableTransformer, and adds a separate method for identifying rows which we believe to be plant totals, rather than rows that pertain to individual fuels. This results in dropping more than 2/3 of all the records in the fuel_ferc1 table. I also made some name changes, trying to be a little more standard and (hopefully) informative: * remove_invalid_rows => drop_invalid_rows, since there were already several methods defined in the FERC 1 table transformers that do similar things in more specific circumstances, and they were named drop_*_cols or drop_*_rows. * cols_to_check => required_valid_cols and cols_to_not_check => allowed_invalid_cols to provide some indication as to what is being checked for (validity / invalidity) in relation to the other transform parameter (invalid_values) * DropInvalidRows => InvalidRows to follow the convention of the other TransformParams which are nouns describing their contents, while the methods & functions that they parameterize are verbs/actions. * Moved the documentation of the InvalidRows parameters into the class definition.

cmgosnell

This looks good overall! I really like the move to compile the transform_params inside of the classes instead of having them be a global variable hanging around.

There's nothing to change code wise from this comment... but I do sometimes feel frustrated when I see you rewriting things I just finished writing/you just reviewed. I try not to take it personally and assume this is partially a part of your learning process but it does feel frustrating and it sometimes seems arbitrary and controlling. Trying not to be a downer, but the feeling is real.

cmgosnell · 2022-09-06T00:01:17Z

src/pudl/transform/classes.py

+    def one_filter_argument(cls, values):
+        """Validate that only one argument is specified for :meth:`pd.filter`."""
+        num_args = sum(
+            int(bool(val))


maybe you think this is more clear this way, but this was previously doing the same thing without the int(bool()) here.

I found the double-negatives both in the variable name and the logic used to figure out how many arguments had been provided confusing.

This would work with just the bool(val) but for readability I wanted to make it explicit that the (potentially non-boolean) values were getting converted to bools, and than we were going to treat the bools as numbers for counting.

This arrangement also works correctly with inputs that show up as boolean False, even if they aren't None. E.g. if someone gives an empty list for one of the lists of columns, and a non-empty list for the other, the empty list counts as a False (so a zero) and that set of inputs is actually fine for what the function is doing.

cmgosnell · 2022-09-06T00:07:35Z

src/pudl/transform/ferc1.py


        In the case of the fuel_ferc1 table, we drop any row where all the data columns
        are null AND there's a non-null value in the ``fuel_mmbtu_per_mwh`` column, as
        it typically indicates a "total" row for a plant. We also require a null value
        for the fuel_units and an "other" value for the fuel type.

-        Right now this is extremely stringent and almost all rows are retained.


why remove this? is this no longer correct? I think this is a helpful lil tidbit.

I made this comment in relation to the idea that this was supposed to be dropping null / invalid records in addition to dropping total rows, but that wasn't really what the function was doing before -- it was only dropping total rows, and keeping about 100,000 functionally null records around.

Now that this is method is just a minor add-on / refinement that comes after the much more sweeping drop_invalid_rows() it didn't seem as weird/noteworthy that it only drops a small number of rows.

(and it also now logs the number of rows that it drops)

src/pudl/transform/classes.py

codecov · 2022-09-06T01:06:51Z

Codecov Report

❗ No coverage uploaded for pull request base (ferc1-transform-phases@9a94f26). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 77eaf9b differs from pull request most recent head fd20677. Consider uploading reports for the commit fd20677 to get more accurate results

@@                   Coverage Diff                    @@
##             ferc1-transform-phases   #1903   +/-   ##
========================================================
  Coverage                          ?   83.3%           
========================================================
  Files                             ?      65           
  Lines                             ?    7430           
  Branches                          ?       0           
========================================================
  Hits                              ?    6192           
  Misses                            ?    1238           
  Partials                          ?       0

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

zaneselvans added ferc1 Anything having to do with FERC Form 1 data-cleaning Tasks related to cleaning & regularizing data during ETL. labels Sep 4, 2022

zaneselvans requested a review from cmgosnell September 4, 2022 20:39

zaneselvans self-assigned this Sep 4, 2022

zaneselvans linked an issue Sep 4, 2022 that may be closed by this pull request

Refine generic table transform architecture #1853

Closed

16 tasks

zaneselvans added 5 commits September 4, 2022 16:49

Merge branch 'ferc1-transform-phases' into drop-invalid-fuel-rows

83f9b5a

Integrate compilation of transform params into AbstractTabletransformer

e2e867a

Integrate compilation of transform params into AbstractTabletransformer

85241ef

Expand docstring to explain what modules in the subpackage need to do.

d2354fc

Merge branch 'ferc1-transform-phases' into drop-invalid-fuel-rows

fd20677

cmgosnell approved these changes Sep 6, 2022

View reviewed changes

zaneselvans added 3 commits September 5, 2022 20:48

Merge branch 'ferc1-transform-phases' into drop-invalid-fuel-rows

2db046c

Clean up a couple of docstrings.

39d4853

Merge branch 'ferc1-transform-phases' into drop-invalid-fuel-rows

a78cfba

zaneselvans merged commit b1f18b5 into ferc1-transform-phases Sep 6, 2022

zaneselvans deleted the drop-invalid-fuel-rows branch September 6, 2022 18:32

zaneselvans mentioned this pull request Sep 6, 2022

Refactor fuel_ferc1 transform for XBRL + DBF inputs #1722

Closed

36 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement drop_invalid_rows() for fuel_ferc1 table #1903

Implement drop_invalid_rows() for fuel_ferc1 table #1903

zaneselvans commented Sep 4, 2022 •

edited

Loading

cmgosnell left a comment

cmgosnell Sep 6, 2022

zaneselvans Sep 6, 2022

cmgosnell Sep 6, 2022

zaneselvans Sep 6, 2022 •

edited

Loading

codecov bot commented Sep 6, 2022

Implement drop_invalid_rows() for fuel_ferc1 table #1903

Implement drop_invalid_rows() for fuel_ferc1 table #1903

Conversation

zaneselvans commented Sep 4, 2022 • edited Loading

cmgosnell left a comment

Choose a reason for hiding this comment

cmgosnell Sep 6, 2022

Choose a reason for hiding this comment

zaneselvans Sep 6, 2022

Choose a reason for hiding this comment

cmgosnell Sep 6, 2022

Choose a reason for hiding this comment

zaneselvans Sep 6, 2022 • edited Loading

Choose a reason for hiding this comment

codecov bot commented Sep 6, 2022

Codecov Report

zaneselvans commented Sep 4, 2022 •

edited

Loading

zaneselvans Sep 6, 2022 •

edited

Loading