-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support multi-column transformers #692
Support multi-column transformers #692
Conversation
rdt/transformers/base.py
Outdated
data (pandas.DataFrame): | ||
The entire table. | ||
""" | ||
column_names = columns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the point of this assignment? Why not just change the name of the variable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes first I was not sure if we really wanted to modify the column parameter, but I removed my change, it was not necessary
for field in self._input_columns: | ||
data = self._fit_field_transformer(data, field, self.field_transformers[field]) | ||
for field_column, field_transformer in self.field_transformers.items(): | ||
data = self._fit_field_transformer(data, field_column, field_transformer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is going to work alone. The _fit_field_transformer
method doesn't know to pass the columns as a dict in the multi column case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the multi column case we pass the columns as a tuple or as a dict? I supposed it was a tuple and so it was good for me based on the docstring of _fit_field_transformer(self, data, field, transformer)
:
field (str or tuple):
Name of column or tuple of columns in data that will be transformed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the base class you made fit takes in data and a dictionary. It needs to be a dict otherwise the values will always have to be passed in a specific order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@R-Palazzo I think this is still the case. The issue states that the HyperTransformer
should run with a multi-column transformer too.
1afbe42
to
5a1feb8
Compare
Codecov ReportPatch coverage:
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## multi-column-transformer #692 +/- ##
==========================================================
Coverage 100.00% 100.00%
==========================================================
Files 17 17
Lines 1799 1862 +63
==========================================================
+ Hits 1799 1862 +63
☔ View full report in Codecov by Sentry. |
ce0e2a2
to
054d9eb
Compare
for field in self._input_columns: | ||
data = self._fit_field_transformer(data, field, self.field_transformers[field]) | ||
for field_column, field_transformer in self.field_transformers.items(): | ||
data = self._fit_field_transformer(data, field_column, field_transformer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@R-Palazzo I think this is still the case. The issue states that the HyperTransformer
should run with a multi-column transformer too.
def test_multi_column_transformer_same_number_of_columns_input_output(): | ||
"""Test a multi-column transformer when the same of input and output columns.""" | ||
# Setup | ||
class AdditionTransformer(BaseMultiColumnTransformer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should actually test this example from the HyperTransformer
and make sure it really works end to end
rdt/transformers/base.py
Outdated
if is_prefix_dict: | ||
prefix = self.prefixes[output_column] | ||
else: | ||
prefix = self.prefixes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this allowing multiple types for the prefixes? I think it could be complicated if developers have to set custom prefixes and output names in 2 places. Maybe it would be simpler to say that if you want to have custom prefixes for each output name, then just use custom output names and set the prefix to None.
In other words, _get_prefix
always returns None or a string. And then you can just do the lines below
rdt/transformers/base.py
Outdated
missing = set(columns_to_sdtypes.keys()) - set(data.columns) | ||
if missing: | ||
missing_to_print = ', '.join(missing) | ||
raise KeyError(f'Columns ({missing_to_print}) are not present in the data.') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this should be a KeyError
. I think that's really only for if the accessed object doesn't have the key. I think maybe a ValueError
makes sense since columns_to_sdtypes
has the wrong values
tests/unit/transformers/test_base.py
Outdated
transformer._get_prefix.assert_called_once_with() | ||
|
||
def test__validate_columns_to_sdtypes(self): | ||
"""Test the ``_validate_columns_to_sdtypes`` method.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: for these descriptions I think it's more useful to say what's being tested. eg
Test that an error with the missing column names is raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
columns_to_sdtypes = {} | ||
for column in field: | ||
columns_to_sdtypes[column] = self.field_sdtypes[column] | ||
return columns_to_sdtypes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Break line above would be appreciated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for addressing
* def + store_columns * hypertransformer modif 1 * unit test * docstring * drop _store_columns change * add integration_test + _get_output_to_property * ordered_columns + prefix * 'int' in integration test * change prefix + columns_to_sdtype * docstring * columns_to_sdtypes * validate column_to_sdtype ValueError + docstring * fix _get_prefix * move PR, hypertransformer with multi column * break line
* def + store_columns * hypertransformer modif 1 * unit test * docstring * drop _store_columns change * add integration_test + _get_output_to_property * ordered_columns + prefix * 'int' in integration test * change prefix + columns_to_sdtype * docstring * columns_to_sdtypes * validate column_to_sdtype ValueError + docstring * fix _get_prefix * move PR, hypertransformer with multi column * break line
* def + store_columns * hypertransformer modif 1 * unit test * docstring * drop _store_columns change * add integration_test + _get_output_to_property * ordered_columns + prefix * 'int' in integration test * change prefix + columns_to_sdtype * docstring * columns_to_sdtypes * validate column_to_sdtype ValueError + docstring * fix _get_prefix * move PR, hypertransformer with multi column * break line
* def + store_columns * hypertransformer modif 1 * unit test * docstring * drop _store_columns change * add integration_test + _get_output_to_property * ordered_columns + prefix * 'int' in integration test * change prefix + columns_to_sdtype * docstring * columns_to_sdtypes * validate column_to_sdtype ValueError + docstring * fix _get_prefix * move PR, hypertransformer with multi column * break line
* def + store_columns * hypertransformer modif 1 * unit test * docstring * drop _store_columns change * add integration_test + _get_output_to_property * ordered_columns + prefix * 'int' in integration test * change prefix + columns_to_sdtype * docstring * columns_to_sdtypes * validate column_to_sdtype ValueError + docstring * fix _get_prefix * move PR, hypertransformer with multi column * break line
Resolve #683