Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-column transformers #692

Merged

Conversation

R-Palazzo
Copy link
Contributor

Resolve #683

@R-Palazzo R-Palazzo requested a review from a team as a code owner August 22, 2023 12:54
@R-Palazzo R-Palazzo removed the request for review from a team August 22, 2023 12:54
data (pandas.DataFrame):
The entire table.
"""
column_names = columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the point of this assignment? Why not just change the name of the variable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes first I was not sure if we really wanted to modify the column parameter, but I removed my change, it was not necessary

for field in self._input_columns:
data = self._fit_field_transformer(data, field, self.field_transformers[field])
for field_column, field_transformer in self.field_transformers.items():
data = self._fit_field_transformer(data, field_column, field_transformer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is going to work alone. The _fit_field_transformer method doesn't know to pass the columns as a dict in the multi column case

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the multi column case we pass the columns as a tuple or as a dict? I supposed it was a tuple and so it was good for me based on the docstring of _fit_field_transformer(self, data, field, transformer):

field (str or tuple):
                Name of column or tuple of columns in data that will be transformed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the base class you made fit takes in data and a dictionary. It needs to be a dict otherwise the values will always have to be passed in a specific order

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@R-Palazzo I think this is still the case. The issue states that the HyperTransformer should run with a multi-column transformer too.

@R-Palazzo R-Palazzo force-pushed the issue-683-support-multi-column-transformers branch from 1afbe42 to 5a1feb8 Compare August 24, 2023 13:32
@codecov-commenter
Copy link

codecov-commenter commented Sep 4, 2023

Codecov Report

Patch coverage: 100.00% and no project coverage change.

Comparison is base (5d7f8b7) 100.00% compared to head (c0d6294) 100.00%.
Report is 3 commits behind head on multi-column-transformer.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@                    Coverage Diff                     @@
##           multi-column-transformer      #692   +/-   ##
==========================================================
  Coverage                    100.00%   100.00%           
==========================================================
  Files                            17        17           
  Lines                          1799      1862   +63     
==========================================================
+ Hits                           1799      1862   +63     
Files Changed Coverage Δ
rdt/hyper_transformer.py 100.00% <100.00%> (ø)
rdt/transformers/__init__.py 100.00% <100.00%> (ø)
rdt/transformers/base.py 100.00% <100.00%> (ø)
rdt/transformers/utils.py 100.00% <100.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@R-Palazzo R-Palazzo force-pushed the issue-683-support-multi-column-transformers branch from ce0e2a2 to 054d9eb Compare September 7, 2023 07:42
@R-Palazzo R-Palazzo requested review from pvk-developer and removed request for frances-h September 8, 2023 08:12
for field in self._input_columns:
data = self._fit_field_transformer(data, field, self.field_transformers[field])
for field_column, field_transformer in self.field_transformers.items():
data = self._fit_field_transformer(data, field_column, field_transformer)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@R-Palazzo I think this is still the case. The issue states that the HyperTransformer should run with a multi-column transformer too.

def test_multi_column_transformer_same_number_of_columns_input_output():
"""Test a multi-column transformer when the same of input and output columns."""
# Setup
class AdditionTransformer(BaseMultiColumnTransformer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should actually test this example from the HyperTransformer and make sure it really works end to end

Comment on lines 540 to 543
if is_prefix_dict:
prefix = self.prefixes[output_column]
else:
prefix = self.prefixes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this allowing multiple types for the prefixes? I think it could be complicated if developers have to set custom prefixes and output names in 2 places. Maybe it would be simpler to say that if you want to have custom prefixes for each output name, then just use custom output names and set the prefix to None.

In other words, _get_prefix always returns None or a string. And then you can just do the lines below

missing = set(columns_to_sdtypes.keys()) - set(data.columns)
if missing:
missing_to_print = ', '.join(missing)
raise KeyError(f'Columns ({missing_to_print}) are not present in the data.')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a KeyError. I think that's really only for if the accessed object doesn't have the key. I think maybe a ValueError makes sense since columns_to_sdtypes has the wrong values

transformer._get_prefix.assert_called_once_with()

def test__validate_columns_to_sdtypes(self):
"""Test the ``_validate_columns_to_sdtypes`` method."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: for these descriptions I think it's more useful to say what's being tested. eg

Test that an error with the missing column names is raised.

@R-Palazzo R-Palazzo changed the base branch from master to multi-column-transformer September 11, 2023 14:54
Copy link
Member

@pvk-developer pvk-developer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

columns_to_sdtypes = {}
for column in field:
columns_to_sdtypes[column] = self.field_sdtypes[column]
return columns_to_sdtypes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Break line above would be appreciated.

Copy link
Contributor

@amontanez24 amontanez24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for addressing

@R-Palazzo R-Palazzo merged commit 89d871d into multi-column-transformer Sep 12, 2023
46 checks passed
@R-Palazzo R-Palazzo deleted the issue-683-support-multi-column-transformers branch September 12, 2023 15:55
R-Palazzo added a commit that referenced this pull request Sep 14, 2023
* def + store_columns

* hypertransformer modif 1

* unit test

* docstring

* drop _store_columns change

* add integration_test + _get_output_to_property

* ordered_columns + prefix

* 'int' in integration test

* change prefix + columns_to_sdtype

* docstring

* columns_to_sdtypes

* validate column_to_sdtype ValueError + docstring

* fix _get_prefix

* move PR, hypertransformer with multi column

* break line
R-Palazzo added a commit that referenced this pull request Sep 20, 2023
* def + store_columns

* hypertransformer modif 1

* unit test

* docstring

* drop _store_columns change

* add integration_test + _get_output_to_property

* ordered_columns + prefix

* 'int' in integration test

* change prefix + columns_to_sdtype

* docstring

* columns_to_sdtypes

* validate column_to_sdtype ValueError + docstring

* fix _get_prefix

* move PR, hypertransformer with multi column

* break line
R-Palazzo added a commit that referenced this pull request Oct 27, 2023
* def + store_columns

* hypertransformer modif 1

* unit test

* docstring

* drop _store_columns change

* add integration_test + _get_output_to_property

* ordered_columns + prefix

* 'int' in integration test

* change prefix + columns_to_sdtype

* docstring

* columns_to_sdtypes

* validate column_to_sdtype ValueError + docstring

* fix _get_prefix

* move PR, hypertransformer with multi column

* break line
R-Palazzo added a commit that referenced this pull request Oct 31, 2023
* def + store_columns

* hypertransformer modif 1

* unit test

* docstring

* drop _store_columns change

* add integration_test + _get_output_to_property

* ordered_columns + prefix

* 'int' in integration test

* change prefix + columns_to_sdtype

* docstring

* columns_to_sdtypes

* validate column_to_sdtype ValueError + docstring

* fix _get_prefix

* move PR, hypertransformer with multi column

* break line
R-Palazzo added a commit that referenced this pull request Oct 31, 2023
* def + store_columns

* hypertransformer modif 1

* unit test

* docstring

* drop _store_columns change

* add integration_test + _get_output_to_property

* ordered_columns + prefix

* 'int' in integration test

* change prefix + columns_to_sdtype

* docstring

* columns_to_sdtypes

* validate column_to_sdtype ValueError + docstring

* fix _get_prefix

* move PR, hypertransformer with multi column

* break line
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support multi-column transformers
4 participants