-
-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement drop_invalid_rows() for fuel_ferc1 table #1903
Conversation
Create a fuel_ferc1 table specific drop_invalid_rows method, which both makes use of the parameterized method inherited from the AbstractTableTransformer, and adds a separate method for identifying rows which we believe to be plant totals, rather than rows that pertain to individual fuels. This results in dropping more than 2/3 of all the records in the fuel_ferc1 table. I also made some name changes, trying to be a little more standard and (hopefully) informative: * remove_invalid_rows => drop_invalid_rows, since there were already several methods defined in the FERC 1 table transformers that do similar things in more specific circumstances, and they were named drop_*_cols or drop_*_rows. * cols_to_check => required_valid_cols and cols_to_not_check => allowed_invalid_cols to provide some indication as to what is being checked for (validity / invalidity) in relation to the other transform parameter (invalid_values) * DropInvalidRows => InvalidRows to follow the convention of the other TransformParams which are nouns describing their contents, while the methods & functions that they parameterize are verbs/actions. * Moved the documentation of the InvalidRows parameters into the class definition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall! I really like the move to compile the transform_params
inside of the classes instead of having them be a global variable hanging around.
There's nothing to change code wise from this comment... but I do sometimes feel frustrated when I see you rewriting things I just finished writing/you just reviewed. I try not to take it personally and assume this is partially a part of your learning process but it does feel frustrating and it sometimes seems arbitrary and controlling. Trying not to be a downer, but the feeling is real.
def one_filter_argument(cls, values): | ||
"""Validate that only one argument is specified for :meth:`pd.filter`.""" | ||
num_args = sum( | ||
int(bool(val)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe you think this is more clear this way, but this was previously doing the same thing without the int(bool())
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the double-negatives both in the variable name and the logic used to figure out how many arguments had been provided confusing.
This would work with just the bool(val)
but for readability I wanted to make it explicit that the (potentially non-boolean) values were getting converted to bools, and than we were going to treat the bools as numbers for counting.
This arrangement also works correctly with inputs that show up as boolean False
, even if they aren't None
. E.g. if someone gives an empty list for one of the lists of columns, and a non-empty list for the other, the empty list counts as a False (so a zero) and that set of inputs is actually fine for what the function is doing.
|
||
In the case of the fuel_ferc1 table, we drop any row where all the data columns | ||
are null AND there's a non-null value in the ``fuel_mmbtu_per_mwh`` column, as | ||
it typically indicates a "total" row for a plant. We also require a null value | ||
for the fuel_units and an "other" value for the fuel type. | ||
|
||
Right now this is extremely stringent and almost all rows are retained. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why remove this? is this no longer correct? I think this is a helpful lil tidbit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made this comment in relation to the idea that this was supposed to be dropping null / invalid records in addition to dropping total rows, but that wasn't really what the function was doing before -- it was only dropping total rows, and keeping about 100,000 functionally null records around.
Now that this is method is just a minor add-on / refinement that comes after the much more sweeping drop_invalid_rows()
it didn't seem as weird/noteworthy that it only drops a small number of rows.
(and it also now logs the number of rows that it drops)
Codecov Report
@@ Coverage Diff @@
## ferc1-transform-phases #1903 +/- ##
========================================================
Coverage ? 83.3%
========================================================
Files ? 65
Lines ? 7430
Branches ? 0
========================================================
Hits ? 6192
Misses ? 1238
Partials ? 0 Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Create a
fuel_ferc1
table-specificdrop_invalid_rows()
method, which both makes use of the parameterized method inherited from theAbstractTableTransformer
, and adds a separate method for identifying rows which we believe to be plant totals, rather than rows that pertain to individual fuels.This results in dropping more than 2/3 of all the records in the
fuel_ferc1
table (around 100,000 rows).I also made some name changes, trying to be a little more standard and (hopefully) informative:
remove_invalid_rows()
=>drop_invalid_rows()
, since there were already several methods defined in the FERC 1 table transformers that do similar things in more specific circumstances, and they were nameddrop_*_cols()
ordrop_*_rows()
.cols_to_check
=>required_valid_cols
andcols_to_not_check
=>allowed_invalid_cols
to provide some indication as to what is being checked for (validity / invalidity) in relation to the other transform parameter (invalid_values
). These could still be improved though I think.DropInvalidRows
=>InvalidRows
to follow the convention of the otherTransformParams
which are nouns describing their contents, while the names of the methods & functions that they parameterize describe the actions they take with verbs.InvalidRows
parameters into the class definition.Closes #1853