Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement drop_invalid_rows() for fuel_ferc1 table #1903

Merged
merged 9 commits into from
Sep 6, 2022

Conversation

zaneselvans
Copy link
Member

@zaneselvans zaneselvans commented Sep 4, 2022

Create a fuel_ferc1 table-specific drop_invalid_rows() method, which both makes use of the parameterized method inherited from the AbstractTableTransformer, and adds a separate method for identifying rows which we believe to be plant totals, rather than rows that pertain to individual fuels.

This results in dropping more than 2/3 of all the records in the fuel_ferc1 table (around 100,000 rows).

I also made some name changes, trying to be a little more standard and (hopefully) informative:

  • remove_invalid_rows() => drop_invalid_rows(), since there were already several methods defined in the FERC 1 table transformers that do similar things in more specific circumstances, and they were named drop_*_cols() or drop_*_rows().
  • cols_to_check => required_valid_cols and cols_to_not_check => allowed_invalid_cols to provide some indication as to what is being checked for (validity / invalidity) in relation to the other transform parameter (invalid_values). These could still be improved though I think.
  • DropInvalidRows => InvalidRows to follow the convention of the other TransformParams which are nouns describing their contents, while the names of the methods & functions that they parameterize describe the actions they take with verbs.
  • Moved the documentation of the InvalidRows parameters into the class definition.

Closes #1853

Create a fuel_ferc1 table specific drop_invalid_rows method, which both makes use of the
parameterized method inherited from the AbstractTableTransformer, and adds a separate
method for identifying rows which we believe to be plant totals, rather than rows that
pertain to individual fuels.

This results in dropping more than 2/3 of all the records in the fuel_ferc1 table.

I also made some name changes, trying to be a little more standard and (hopefully)
informative:

* remove_invalid_rows => drop_invalid_rows, since there were already several methods
  defined in the FERC 1 table transformers that do similar things in more specific
  circumstances, and they were named drop_*_cols or drop_*_rows.
* cols_to_check => required_valid_cols and cols_to_not_check => allowed_invalid_cols to
  provide some indication as to what is being checked for (validity / invalidity) in
  relation to the other transform parameter (invalid_values)
* DropInvalidRows => InvalidRows to follow the convention of the other TransformParams
  which are nouns describing their contents, while the methods & functions that they
  parameterize are verbs/actions.
* Moved the documentation of the InvalidRows parameters into the class definition.
@zaneselvans zaneselvans added ferc1 Anything having to do with FERC Form 1 data-cleaning Tasks related to cleaning & regularizing data during ETL. labels Sep 4, 2022
@zaneselvans zaneselvans self-assigned this Sep 4, 2022
@zaneselvans zaneselvans linked an issue Sep 4, 2022 that may be closed by this pull request
16 tasks
Copy link
Member

@cmgosnell cmgosnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall! I really like the move to compile the transform_params inside of the classes instead of having them be a global variable hanging around.

There's nothing to change code wise from this comment... but I do sometimes feel frustrated when I see you rewriting things I just finished writing/you just reviewed. I try not to take it personally and assume this is partially a part of your learning process but it does feel frustrating and it sometimes seems arbitrary and controlling. Trying not to be a downer, but the feeling is real.

def one_filter_argument(cls, values):
"""Validate that only one argument is specified for :meth:`pd.filter`."""
num_args = sum(
int(bool(val))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe you think this is more clear this way, but this was previously doing the same thing without the int(bool()) here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the double-negatives both in the variable name and the logic used to figure out how many arguments had been provided confusing.

This would work with just the bool(val) but for readability I wanted to make it explicit that the (potentially non-boolean) values were getting converted to bools, and than we were going to treat the bools as numbers for counting.

This arrangement also works correctly with inputs that show up as boolean False, even if they aren't None. E.g. if someone gives an empty list for one of the lists of columns, and a non-empty list for the other, the empty list counts as a False (so a zero) and that set of inputs is actually fine for what the function is doing.


In the case of the fuel_ferc1 table, we drop any row where all the data columns
are null AND there's a non-null value in the ``fuel_mmbtu_per_mwh`` column, as
it typically indicates a "total" row for a plant. We also require a null value
for the fuel_units and an "other" value for the fuel type.

Right now this is extremely stringent and almost all rows are retained.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why remove this? is this no longer correct? I think this is a helpful lil tidbit.

Copy link
Member Author

@zaneselvans zaneselvans Sep 6, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this comment in relation to the idea that this was supposed to be dropping null / invalid records in addition to dropping total rows, but that wasn't really what the function was doing before -- it was only dropping total rows, and keeping about 100,000 functionally null records around.

Now that this is method is just a minor add-on / refinement that comes after the much more sweeping drop_invalid_rows() it didn't seem as weird/noteworthy that it only drops a small number of rows.

(and it also now logs the number of rows that it drops)

src/pudl/transform/classes.py Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 6, 2022

Codecov Report

❗ No coverage uploaded for pull request base (ferc1-transform-phases@9a94f26). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 77eaf9b differs from pull request most recent head fd20677. Consider uploading reports for the commit fd20677 to get more accurate results

@@                   Coverage Diff                    @@
##             ferc1-transform-phases   #1903   +/-   ##
========================================================
  Coverage                          ?   83.3%           
========================================================
  Files                             ?      65           
  Lines                             ?    7430           
  Branches                          ?       0           
========================================================
  Hits                              ?    6192           
  Misses                            ?    1238           
  Partials                          ?       0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@zaneselvans zaneselvans merged commit b1f18b5 into ferc1-transform-phases Sep 6, 2022
@zaneselvans zaneselvans deleted the drop-invalid-fuel-rows branch September 6, 2022 18:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data-cleaning Tasks related to cleaning & regularizing data during ETL. ferc1 Anything having to do with FERC Form 1
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Refine generic table transform architecture
2 participants