Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create ways for HyperTransformer to know which transformers to apply to each data type #232 #239

Merged
merged 9 commits into from
Sep 24, 2021

Conversation

amontanez24
Copy link
Contributor

resolves #232

This PR adds the following attributes to the HyperTransformer

  1. _DTYPES_TO_DATA_TYPES - a dict mapping the pandas dtypes to an RDT data type
  2. DEFAULT_TRANSFORMERS - a dict mapping the RDT data types to a default transformer

It also adds the following static method

  1. get_transformers_by_type - a function that loops through all existing transformers and creates a dict mapping data types to a list of valid transformers for that type.

@amontanez24 amontanez24 changed the base branch from master to update-baseclass September 21, 2021 21:37
@csala csala changed the title RDT - Create ways for HyperTransformer to know which transformers to apply to each data type #232 Create ways for HyperTransformer to know which transformers to apply to each data type #232 Sep 22, 2021
Base automatically changed from update-baseclass to v0.6.0-dev September 22, 2021 21:41
Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a couple of comments to suggest a different approach to this implementation.

}

@staticmethod
def get_transformers_by_type():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make more sense to put this as a function inside the transformers module, either directly in the __init__ module or in a utils module and then importing it from the __init__.

I imagine the use case where someone simply wants to explore what transformers exist, without using the HyperTransformer

>>> import rdt
>>> rdt.transformers.get_transformers_by_type()
...

type as an input.
"""
data_type_transformers = {}
transformer_classes = inspect.getmembers(sys.modules[__name__], inspect.isclass)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than using inspect, I would navigate the subclasses of BaseTransformer by adding a get_subclasses classmethod to it.

Something similar is done in SDGym: https://github.com/sdv-dev/SDGym/blob/79321b57a1b2dd416426fc9088e1ff771dd82e9c/sdgym/synthesizers/base.py#L17

Also, as a note, when implementing it here I would merge the get_subclasses and get_baselines logic, so only the classes that do not inherit from ABC directly are included in the output. By doing it this way, we can skip any intermediate abstract Transformers that may end up defining.

@@ -73,6 +75,42 @@ class HyperTransformer:
'b': 'boolean',
'M': 'datetime',
}
DEFAULT_TRANSFORMERS = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could also be moved over to the transformers module, and create an additional get_default_transformer(data_type) function that has the functionality of looking up the given data_type on this dictionary and if not found calls the get_transformers_by_type function to choose the first one that it finds that supports it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the idea of calling get_transformers_by_type every time we don't have a defined transformer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'll do is add a function that creates the default dict using the DEFAULT_TRANSFORMERS and get_transformers_by_type dicts, and we can just use that dict in the HyperTransformer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'll do is add a function that creates the default dict using the DEFAULT_TRANSFORMERS and get_transformers_by_type dicts, and we can just use that dict in the HyperTransformer

I had not seen this. I agree, calling it every time sounds overkill, but somehow I do not like the idea of also calling the function when the HyperTransformer is imported, or every time an instance is created.

What do you think about using functools.cache?
We could actually have get_default_transformers() which caches its result the first time it is called, but then also have get_default_transformer(transformer), which also caches its results, and that under the hood calls the other one and gets the transformer from the dict.

By doing this, the result is the behavior is the same (we end up having the dictionary stored somewhere instead of building it every time), but its initialization is lazy (the dict gets build on the fly the first time it is used)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Let me push a commit and you can take a look

@@ -73,6 +75,42 @@ class HyperTransformer:
'b': 'boolean',
'M': 'datetime',
}
DEFAULT_TRANSFORMERS = {
'numerical': NumericalTransformer,
'integer': NumericalTransformer(dtype=int),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should create specific transformers for each configuration, so this dictionary can contain classes instead of instances? I'm opening a new issue to discuss this there.

@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2021

Codecov Report

Merging #239 (9b2c4f6) into v0.6.0-dev (03ecfec) will decrease coverage by 1.46%.
The diff coverage is 50.00%.

Impacted file tree graph

@@              Coverage Diff               @@
##           v0.6.0-dev     #239      +/-   ##
==============================================
- Coverage       93.07%   91.61%   -1.47%     
==============================================
  Files               9        9              
  Lines             650      739      +89     
==============================================
+ Hits              605      677      +72     
- Misses             45       62      +17     
Impacted Files Coverage Δ
rdt/transformers/__init__.py 60.86% <30.76%> (-39.14%) ⬇️
rdt/hyper_transformer.py 100.00% <100.00%> (ø)
rdt/transformers/base.py 84.53% <100.00%> (+1.58%) ⬆️
rdt/transformers/boolean.py 100.00% <0.00%> (ø)
rdt/transformers/null.py 98.07% <0.00%> (+1.92%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03ecfec...9b2c4f6. Read the comment docs.

@amontanez24 amontanez24 marked this pull request as ready for review September 23, 2021 19:41
@csala csala requested review from katxiao and a team and removed request for a team September 24, 2021 12:06
@@ -25,6 +25,14 @@
transformer.__name__: transformer
for transformer in BaseTransformer.__subclasses__()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could call BaseTransformers.get_subclasses() here, so TRANSFORMERS is a list of all transformers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also could we add 'GaussianCopulaTransformer to the __all__ list above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's already there

Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! I just added a couple of minor comments.

for transformer in transformer_classes:
try:
input_type = transformer.get_input_type()
transformers_for_type = data_type_transformers.get(input_type, [])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: We could use a defaultdict(list):

data_type_transformers = defaultdict(list)
...
data_type_transformers[input_type].append(transformer)

Or setdefault:

data_type_transformers.setdefault(input_type, []).append(transformer)

for subclass in cls.__subclasses__():
if abc.ABC not in subclass.__bases__:
subclasses.append(subclass)
subclasses += subclass.get_subclasses()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a blank line above this one?

Copy link
Contributor

@katxiao katxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, meant to request changes.

Copy link
Contributor

@csala csala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great now! I just added a comment about functools dependency. We may need to do nothing about it, though

@@ -1,5 +1,8 @@
"""Transformers module."""

from collections import defaultdict
from functools import lru_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if functools is part of the standard library. If it is not, we should add it to setup.py, even if we already install it because one of our dependencies use it, so if they remove it in the future we do not crash.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is part of the standard library

@amontanez24 amontanez24 merged commit 3be22ab into v0.6.0-dev Sep 24, 2021
@amontanez24 amontanez24 deleted the rdt-232-data-type-transformers branch September 24, 2021 17:06
csala pushed a commit that referenced this pull request Oct 13, 2021
…to each data type #232 (#239)

* adding get_transformers_by_type function

* adding other attributes and fixing typo

* pr comments

* adding default transformers method

* pr comments

* adding caching and some cleanup
amontanez24 added a commit that referenced this pull request Oct 26, 2021
…to each data type #232 (#239)

* adding get_transformers_by_type function

* adding other attributes and fixing typo

* pr comments

* adding default transformers method

* pr comments

* adding caching and some cleanup
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create ways for HyperTransformer to know which transformers to apply to each data type
4 participants