Add UniformEncoder (and its ordered version) #681

R-Palazzo · 2023-08-10T15:14:31Z

Resolve #678
Compared to the other version, a few changes were necessary in order to:

get 100% coverage (2 lines were missing)
make the minimum version works.
make the uniform encoder works with pd.category dtype

In this PR, the UniformEncoder is set to be the default transformer for categorical and boolean data.
The 1st commit is only moving the files, the other ones made the fixes to make it work on RDT

fealho

Looking good, a few comments/questions.

fealho · 2023-08-10T15:43:34Z

tests/performance/test_performance.py

 from rdt.transformers.numerical import ClusterBasedNormalizer

-SANDBOX_TRANSFORMERS = [ClusterBasedNormalizer, OrderedLabelEncoder, CustomLabelEncoder]
+SANDBOX_TRANSFORMERS = [
+    ClusterBasedNormalizer, OrderedLabelEncoder, CustomLabelEncoder, OrderedUniformEncoder


Why does OrderedUniformEncoder need to be sandboxed?

It's because otherwise the perfomance workflow is crashing with this type error:
FAILED tests/performance/test_performance.py::test_performance[OrderedUniformEncoder-UniqueStringNaNsGenerator] - TypeError: __init__() missing 1 required positional argument: 'order'
I think this is the same reason why the OrderedLabelEncoder is sandboxed

fealho · 2023-08-10T15:46:58Z

tests/unit/transformers/test_categorical.py

        transformer = OrderedUniformEncoder(order=[2, 1])

        # Run / Assert
+        transformer._fit(data)


I think it makes sense to move the _fit/_transform into their own test and leave this one as it was, otherwise this one gets confusing.

Yes I agree, reverted the change in df0ad68.

fealho · 2023-08-10T15:48:05Z

tests/unit/transformers/test_categorical.py

        # Setup
-        data = pd.Series([1, 2, 3, 2, np.nan, 1, 1])
-        transformer = OrderedUniformEncoder(order=[2, 3, np.nan, 1])
+        data = pd.Series([1, 2, 3, 2, None, 1, 1])


Why did you change this to None instead of np.nan?

Yes that was not necessary at the end haha good catch, reverted in df0ad68

fealho · 2023-08-10T15:51:48Z

rdt/transformers/categorical.py

@@ -264,7 +264,10 @@ def _fit(self, data):
        else:
            freq = data.value_counts(normalize=True, dropna=False)

+        nan_value = freq[np.nan] if np.nan in freq.index else None


I'm not sure this works with other types of nans, like float('nan')

Maybe but this should not happen because freq is defined by data.value_counts(), no?

I was concerned with the freq.index having nans in there which are not np.nan. I'm not sure if it's impossible for that to happen.

codecov-commenter · 2023-08-10T16:34:22Z

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (74f20ac) 100.00% compared to head (712fe62) 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

Additional details and impacted files

@@            Coverage Diff             @@
##            master      #681    +/-   ##
==========================================
  Coverage   100.00%   100.00%            
==========================================
  Files           17        17            
  Lines         1660      1774   +114     
==========================================
+ Hits          1660      1774   +114

Files Changed	Coverage Δ
rdt/transformers/__init__.py	`100.00% <ø> (ø)`
rdt/transformers/categorical.py	`100.00% <100.00%> (ø)`
rdt/transformers/utils.py	`100.00% <100.00%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

R-Palazzo · 2023-08-10T16:39:21Z

Thanks for your review @fealho ;)

rdt/transformers/categorical.py

amontanez24

LGTM!

fealho

Thanks for addressing 👍

R-Palazzo added 7 commits August 3, 2023 10:40

move unifrom and ordered uniform encoder

1b410c7

performance

4a917a8

100% coverage

2d81f68

test minimum version

21e29a9

freq and nans for minimum version

1a0e24b

make UniformEncoder the default for cat and boolea

83f4112

test minimum version

f95d925

R-Palazzo requested review from amontanez24 and fealho August 10, 2023 15:14

R-Palazzo requested a review from a team as a code owner August 10, 2023 15:14

R-Palazzo removed the request for review from a team August 10, 2023 15:14

fealho reviewed Aug 10, 2023

View reviewed changes

address comments

df0ad68

amontanez24 reviewed Aug 11, 2023

View reviewed changes

rdt/transformers/categorical.py Outdated Show resolved Hide resolved

R-Palazzo added 2 commits August 11, 2023 12:29

add test for coverage

754b8b2

use fill_value

712fe62

amontanez24 approved these changes Aug 11, 2023

View reviewed changes

fealho self-requested a review August 14, 2023 16:50

fealho approved these changes Aug 14, 2023

View reviewed changes

R-Palazzo merged commit d7dccc9 into master Aug 14, 2023
46 checks passed

R-Palazzo deleted the issue-678-add-uniform-encoder branch August 14, 2023 16:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UniformEncoder (and its ordered version) #681

Add UniformEncoder (and its ordered version) #681

R-Palazzo commented Aug 10, 2023 •

edited

Loading

fealho left a comment

fealho Aug 10, 2023

R-Palazzo Aug 10, 2023

fealho Aug 10, 2023

R-Palazzo Aug 10, 2023

fealho Aug 10, 2023

R-Palazzo Aug 10, 2023

fealho Aug 10, 2023

R-Palazzo Aug 10, 2023

fealho Aug 10, 2023

codecov-commenter commented Aug 10, 2023 •

edited

Loading

R-Palazzo commented Aug 10, 2023

amontanez24 left a comment

fealho left a comment

Add UniformEncoder (and its ordered version) #681

Add UniformEncoder (and its ordered version) #681

Conversation

R-Palazzo commented Aug 10, 2023 • edited Loading

fealho left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Aug 10, 2023 • edited Loading

Codecov Report

R-Palazzo commented Aug 10, 2023

amontanez24 left a comment

Choose a reason for hiding this comment

fealho left a comment

Choose a reason for hiding this comment

R-Palazzo commented Aug 10, 2023 •

edited

Loading

codecov-commenter commented Aug 10, 2023 •

edited

Loading