-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UniformEncoder (and its ordered version) #681
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, a few comments/questions.
from rdt.transformers.numerical import ClusterBasedNormalizer | ||
|
||
SANDBOX_TRANSFORMERS = [ClusterBasedNormalizer, OrderedLabelEncoder, CustomLabelEncoder] | ||
SANDBOX_TRANSFORMERS = [ | ||
ClusterBasedNormalizer, OrderedLabelEncoder, CustomLabelEncoder, OrderedUniformEncoder |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does OrderedUniformEncoder
need to be sandboxed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's because otherwise the perfomance workflow is crashing with this type error:
FAILED tests/performance/test_performance.py::test_performance[OrderedUniformEncoder-UniqueStringNaNsGenerator] - TypeError: __init__() missing 1 required positional argument: 'order'
I think this is the same reason why the OrderedLabelEncoder
is sandboxed
transformer = OrderedUniformEncoder(order=[2, 1]) | ||
|
||
# Run / Assert | ||
transformer._fit(data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to move the _fit/_transform into their own test and leave this one as it was, otherwise this one gets confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I agree, reverted the change in df0ad68.
# Setup | ||
data = pd.Series([1, 2, 3, 2, np.nan, 1, 1]) | ||
transformer = OrderedUniformEncoder(order=[2, 3, np.nan, 1]) | ||
data = pd.Series([1, 2, 3, 2, None, 1, 1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you change this to None instead of np.nan?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that was not necessary at the end haha good catch, reverted in df0ad68
@@ -264,7 +264,10 @@ def _fit(self, data): | |||
else: | |||
freq = data.value_counts(normalize=True, dropna=False) | |||
|
|||
nan_value = freq[np.nan] if np.nan in freq.index else None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this works with other types of nans, like float('nan')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe but this should not happen because freq is defined by data.value_counts()
, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was concerned with the freq.index having nans in there which are not np.nan. I'm not sure if it's impossible for that to happen.
Codecov ReportPatch coverage:
❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more. Additional details and impacted files@@ Coverage Diff @@
## master #681 +/- ##
==========================================
Coverage 100.00% 100.00%
==========================================
Files 17 17
Lines 1660 1774 +114
==========================================
+ Hits 1660 1774 +114
☔ View full report in Codecov by Sentry. |
Thanks for your review @fealho ;) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing 👍
Resolve #678
Compared to the other version, a few changes were necessary in order to:
pd.category
dtypeIn this PR, the
UniformEncoder
is set to be the default transformer forcategorical
andboolean
data.The 1st commit is only moving the files, the other ones made the fixes to make it work on
RDT