-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update NullTransformer to make it user friendly #384
Update NullTransformer to make it user friendly #384
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. I left a few comments
rdt/transformers/null.py
Outdated
|
||
def fit(self, data): | ||
def fit(self, data, column): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the only reason we are passing this is for the warning message then I'm not sure we need it. Couldn't we get the column name by doing data.name
?
rdt/transformers/null.py
Outdated
|
||
def creates_null_column(self): | ||
def creates_model_missing_values(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I'd call this models_missing_values
rdt/transformers/boolean.py
Outdated
@@ -16,10 +16,10 @@ class BooleanTransformer(BaseTransformer): | |||
Null values are replaced using a ``NullTransformer``. | |||
|
|||
Args: | |||
nan (int or None): | |||
missing_value_replacement (int or None): | |||
Replace null values with the given value. If ``None``, do not replace them. | |||
Defaults to ``-1``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't true anymore right?
rdt/transformers/boolean.py
Outdated
@@ -16,10 +16,10 @@ class BooleanTransformer(BaseTransformer): | |||
Null values are replaced using a ``NullTransformer``. | |||
|
|||
Args: | |||
nan (int or None): | |||
missing_value_replacement (int or None): | |||
Replace null values with the given value. If ``None``, do not replace them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd update this to match what's in the NullTransformer
as well (ie. explain mode
and so on)
Codecov Report
@@ Coverage Diff @@
## v1.0.0.dev #384 +/- ##
============================================
Coverage 100.00% 100.00%
============================================
Files 12 12
Lines 942 935 -7
============================================
- Hits 942 935 -7
Continue to review full report at Codecov.
|
5901e9d
to
1fd685b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR itself looks good, but I think you need to do another pass through the docstrings. I commented on the errors I saw, but you should probably double check the rest of the docstrings as well.
rdt/transformers/datetime.py
Outdated
@@ -180,18 +182,19 @@ class DatetimeRoundedTransformer(DatetimeTransformer): | |||
This class behaves exactly as the ``DatetimeTransformer`` with ``strip_constant=True``. | |||
|
|||
Args: | |||
nan (int, str or None): | |||
missing_value_replacement (int, str or None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should put object everywhere.
rdt/transformers/datetime.py
Outdated
Indicate what to do with the null values. If an integer is given, replace them | ||
with the given value. If the strings ``'mean'`` or ``'mode'`` are given, replace | ||
them with the corresponding aggregation. If ``None`` is given, do not replace them. | ||
Defaults to ``'mean'``. | ||
null_column (bool): | ||
model_missing_values (bool): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the None
docstring.
rdt/transformers/null.py
Outdated
self.null_column = null_column | ||
self.copy = copy | ||
def __init__(self, missing_value_replacement=None, model_missing_values=False): | ||
self.missing_value_replacement = missing_value_replacement |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have an underscore in front(since you use self._missing_value_replacement
around the code).
- First column of the input array rounded, replacing the indicated value with a Nan, | ||
and kept as float values. | ||
- First column of the input array rounded, replacing the indicated value with a | ||
missing_value_replacement, and kept as float values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dont change this.
"""Test the ``_transform`` method. | ||
|
||
Validate that ``_transform`` produces the correct values when ``null_column`` is None. | ||
Validate that ``_transform`` produces the correct values when ``model_missing_values`` | ||
is None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not None, False.
@@ -1152,7 +1249,7 @@ def test__transform_null_column_none(self): | |||
ct = GaussianCopulaTransformer() | |||
ct._univariate = Mock() | |||
ct._univariate.cdf.return_value = np.array([0.25, 0.5, 0.75, 0.5]) | |||
ct.null_transformer = NullTransformer(None, null_column=None) | |||
ct.null_transformer = NullTransformer('mean', model_missing_values=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not None.
@@ -1228,7 +1328,7 @@ def test__reverse_transform_null_column_none(self): | |||
ct = GaussianCopulaTransformer() | |||
ct._univariate = Mock() | |||
ct._univariate.ppf.return_value = np.array([0.0, 1.0, 2.0, 1.0]) | |||
ct.null_transformer = NullTransformer(None, null_column=None) | |||
ct.null_transformer = NullTransformer(None, model_missing_values=None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not None
rdt/transformers/null.py
Outdated
_null_column = None | ||
_fill_value = None | ||
_model_missing_values = None | ||
_missing_value_replacement = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought all variables should be set to None
here? I'm not sure though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor comment but once you finish addressing that and the other comments it should be good to go!
rdt/transformers/boolean.py
Outdated
If ``False``, do not create the new column. | ||
missing_value_replacement (object or None): | ||
Indicate what to do with the null values. If an object is given, replace them | ||
with the given value. If the strings ``'mean'`` or ``'mode'`` are given, replace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we should remove mean
since that doesn't really make sense for booleans
@@ -129,7 +133,8 @@ def test__learn_rounding_digits_less_than_15_decimals(self): | |||
should be returned. | |||
|
|||
Input: | |||
- An array that contains floats with a maximum of 3 decimals and a NaN. | |||
- An array that contains floats with a maximum of 3 decimals and a | |||
missing_value_replacement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After you put the NaN
's back I think it's good to go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more NaN's to change back.
@@ -594,8 +648,8 @@ def test__reverse_transform_rounding_none_with_nulls_dtype_int(self): | |||
Input: | |||
- 2d Array of multiple float values with decimals and a column setting at least 1 null. | |||
Output: | |||
- First column of the input array rounded, replacing the indicated value with a Nan, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here.
@@ -177,13 +182,13 @@ def test__learn_rounding_digits_negative_decimals_integer(self): | |||
|
|||
assert output == -1 | |||
|
|||
def test__learn_rounding_digits_all_nans(self): | |||
"""Test the _learn_rounding_digits method with data that is all NaNs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
|
||
If the data is all NaNs, expect that the output is None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
|
||
Input: | ||
- An array of NaNs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
NullTransformer
With the current
PR
theNullTransformer
contains the following changes:fill_value
ornan
becomes ->missing_value_replacement
.null_column
becomes ->model_missing_values
.missing_value_replacement
By default this value will be
None
which will not replace the values. From python perspective this will returnnp.nan
which should be the representation ofNull
.The second strategy is the user specifying a string
mode
ormean
, which will use themode
of the data or themean
value (if possible) as a replacement of thenull
values that are in the data.The third strategy is the user specifying a value for it which will be used to fill the null values with.
model_missing_values
By default this value will be
False
. If set toTrue
it will generate a new column representing if there was a missing value or not. In the case that the data had no missing values this column will not be generated.Resolves #372 , #362 , #376