Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update NullTransformer to make it user friendly #384

Merged
merged 21 commits into from
Feb 10, 2022

Conversation

pvk-developer
Copy link
Member

NullTransformer

With the current PR the NullTransformer contains the following changes:

  • Always works with a copy of the data (no changes occur to the original data).
  • Adapt unit tests and integration tests.
  • New user friendly names for the fill value / nan and null column:
    • fill_value or nan becomes -> missing_value_replacement.
    • null_column becomes -> model_missing_values.

missing_value_replacement

By default this value will be None which will not replace the values. From python perspective this will return np.nan which should be the representation of Null.

The second strategy is the user specifying a string mode or mean , which will use the mode of the data or the mean value (if possible) as a replacement of the null values that are in the data.

The third strategy is the user specifying a value for it which will be used to fill the null values with.

model_missing_values

By default this value will be False. If set to True it will generate a new column representing if there was a missing value or not. In the case that the data had no missing values this column will not be generated.

Resolves #372 , #362 , #376

@pvk-developer pvk-developer requested a review from a team as a code owner February 7, 2022 12:56
@pvk-developer pvk-developer requested review from fealho and amontanez24 and removed request for a team February 7, 2022 12:56
Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. I left a few comments


def fit(self, data):
def fit(self, data, column):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only reason we are passing this is for the warning message then I'm not sure we need it. Couldn't we get the column name by doing data.name?


def creates_null_column(self):
def creates_model_missing_values(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I'd call this models_missing_values

tests/unit/transformers/test_base.py Show resolved Hide resolved
@@ -16,10 +16,10 @@ class BooleanTransformer(BaseTransformer):
Null values are replaced using a ``NullTransformer``.

Args:
nan (int or None):
missing_value_replacement (int or None):
Replace null values with the given value. If ``None``, do not replace them.
Defaults to ``-1``.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't true anymore right?

rdt/performance/performance.py Show resolved Hide resolved
@@ -16,10 +16,10 @@ class BooleanTransformer(BaseTransformer):
Null values are replaced using a ``NullTransformer``.

Args:
nan (int or None):
missing_value_replacement (int or None):
Replace null values with the given value. If ``None``, do not replace them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd update this to match what's in the NullTransformer as well (ie. explain mode and so on)

@codecov-commenter
Copy link

codecov-commenter commented Feb 8, 2022

Codecov Report

Merging #384 (ec32d26) into v1.0.0.dev (9507870) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##           v1.0.0.dev      #384   +/-   ##
============================================
  Coverage      100.00%   100.00%           
============================================
  Files              12        12           
  Lines             942       935    -7     
============================================
- Hits              942       935    -7     
Impacted Files Coverage Δ
rdt/transformers/base.py 100.00% <100.00%> (ø)
rdt/transformers/boolean.py 100.00% <100.00%> (ø)
rdt/transformers/datetime.py 100.00% <100.00%> (ø)
rdt/transformers/null.py 100.00% <100.00%> (ø)
rdt/transformers/numerical.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9507870...ec32d26. Read the comment docs.

@pvk-developer pvk-developer force-pushed the issue_372_update_NullTransformer_to_new_api branch from 5901e9d to 1fd685b Compare February 8, 2022 20:03
Copy link
Member

@fealho fealho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR itself looks good, but I think you need to do another pass through the docstrings. I commented on the errors I saw, but you should probably double check the rest of the docstrings as well.

rdt/transformers/boolean.py Outdated Show resolved Hide resolved
rdt/transformers/datetime.py Show resolved Hide resolved
@@ -180,18 +182,19 @@ class DatetimeRoundedTransformer(DatetimeTransformer):
This class behaves exactly as the ``DatetimeTransformer`` with ``strip_constant=True``.

Args:
nan (int, str or None):
missing_value_replacement (int, str or None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should put object everywhere.

Indicate what to do with the null values. If an integer is given, replace them
with the given value. If the strings ``'mean'`` or ``'mode'`` are given, replace
them with the corresponding aggregation. If ``None`` is given, do not replace them.
Defaults to ``'mean'``.
null_column (bool):
model_missing_values (bool):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the None docstring.

self.null_column = null_column
self.copy = copy
def __init__(self, missing_value_replacement=None, model_missing_values=False):
self.missing_value_replacement = missing_value_replacement
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have an underscore in front(since you use self._missing_value_replacement around the code).

- First column of the input array rounded, replacing the indicated value with a Nan,
and kept as float values.
- First column of the input array rounded, replacing the indicated value with a
missing_value_replacement, and kept as float values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont change this.

"""Test the ``_transform`` method.

Validate that ``_transform`` produces the correct values when ``null_column`` is None.
Validate that ``_transform`` produces the correct values when ``model_missing_values``
is None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not None, False.

@@ -1152,7 +1249,7 @@ def test__transform_null_column_none(self):
ct = GaussianCopulaTransformer()
ct._univariate = Mock()
ct._univariate.cdf.return_value = np.array([0.25, 0.5, 0.75, 0.5])
ct.null_transformer = NullTransformer(None, null_column=None)
ct.null_transformer = NullTransformer('mean', model_missing_values=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not None.

@@ -1228,7 +1328,7 @@ def test__reverse_transform_null_column_none(self):
ct = GaussianCopulaTransformer()
ct._univariate = Mock()
ct._univariate.ppf.return_value = np.array([0.0, 1.0, 2.0, 1.0])
ct.null_transformer = NullTransformer(None, null_column=None)
ct.null_transformer = NullTransformer(None, model_missing_values=None)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not None

_null_column = None
_fill_value = None
_model_missing_values = None
_missing_value_replacement = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought all variables should be set to None here? I'm not sure though.

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor comment but once you finish addressing that and the other comments it should be good to go!

If ``False``, do not create the new column.
missing_value_replacement (object or None):
Indicate what to do with the null values. If an object is given, replace them
with the given value. If the strings ``'mean'`` or ``'mode'`` are given, replace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should remove mean since that doesn't really make sense for booleans

@@ -129,7 +133,8 @@ def test__learn_rounding_digits_less_than_15_decimals(self):
should be returned.

Input:
- An array that contains floats with a maximum of 3 decimals and a NaN.
- An array that contains floats with a maximum of 3 decimals and a
missing_value_replacement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After you put the NaN's back I think it's good to go.

Copy link
Member

@fealho fealho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more NaN's to change back.

@@ -594,8 +648,8 @@ def test__reverse_transform_rounding_none_with_nulls_dtype_int(self):
Input:
- 2d Array of multiple float values with decimals and a column setting at least 1 null.
Output:
- First column of the input array rounded, replacing the indicated value with a Nan,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here.

@@ -177,13 +182,13 @@ def test__learn_rounding_digits_negative_decimals_integer(self):

assert output == -1

def test__learn_rounding_digits_all_nans(self):
"""Test the _learn_rounding_digits method with data that is all NaNs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here


If the data is all NaNs, expect that the output is None.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here


Input:
- An array of NaNs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants