-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create ways for HyperTransformer to know which transformers to apply to each data type #232 #239
Conversation
1b3c158
to
57a1932
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a couple of comments to suggest a different approach to this implementation.
rdt/hyper_transformer.py
Outdated
} | ||
|
||
@staticmethod | ||
def get_transformers_by_type(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make more sense to put this as a function inside the transformers
module, either directly in the __init__
module or in a utils
module and then importing it from the __init__
.
I imagine the use case where someone simply wants to explore what transformers exist, without using the HyperTransformer
>>> import rdt
>>> rdt.transformers.get_transformers_by_type()
...
rdt/hyper_transformer.py
Outdated
type as an input. | ||
""" | ||
data_type_transformers = {} | ||
transformer_classes = inspect.getmembers(sys.modules[__name__], inspect.isclass) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using inspect
, I would navigate the subclasses of BaseTransformer
by adding a get_subclasses
classmethod to it.
Something similar is done in SDGym: https://github.com/sdv-dev/SDGym/blob/79321b57a1b2dd416426fc9088e1ff771dd82e9c/sdgym/synthesizers/base.py#L17
Also, as a note, when implementing it here I would merge the get_subclasses
and get_baselines
logic, so only the classes that do not inherit from ABC
directly are included in the output. By doing it this way, we can skip any intermediate abstract Transformers that may end up defining.
rdt/hyper_transformer.py
Outdated
@@ -73,6 +75,42 @@ class HyperTransformer: | |||
'b': 'boolean', | |||
'M': 'datetime', | |||
} | |||
DEFAULT_TRANSFORMERS = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could also be moved over to the transformers
module, and create an additional get_default_transformer(data_type)
function that has the functionality of looking up the given data_type
on this dictionary and if not found calls the get_transformers_by_type
function to choose the first one that it finds that supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the idea of calling get_transformers_by_type
every time we don't have a defined transformer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'll do is add a function that creates the default dict using the DEFAULT_TRANSFORMERS
and get_transformers_by_type
dicts, and we can just use that dict in the HyperTransformer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'll do is add a function that creates the default dict using the
DEFAULT_TRANSFORMERS
andget_transformers_by_type
dicts, and we can just use that dict in theHyperTransformer
I had not seen this. I agree, calling it every time sounds overkill, but somehow I do not like the idea of also calling the function when the HyperTransformer is imported, or every time an instance is created.
What do you think about using functools.cache?
We could actually have get_default_transformers()
which caches its result the first time it is called, but then also have get_default_transformer(transformer)
, which also caches its results, and that under the hood calls the other one and gets the transformer from the dict.
By doing this, the result is the behavior is the same (we end up having the dictionary stored somewhere instead of building it every time), but its initialization is lazy (the dict gets build on the fly the first time it is used)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Let me push a commit and you can take a look
rdt/hyper_transformer.py
Outdated
@@ -73,6 +75,42 @@ class HyperTransformer: | |||
'b': 'boolean', | |||
'M': 'datetime', | |||
} | |||
DEFAULT_TRANSFORMERS = { | |||
'numerical': NumericalTransformer, | |||
'integer': NumericalTransformer(dtype=int), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should create specific transformers for each configuration, so this dictionary can contain classes instead of instances? I'm opening a new issue to discuss this there.
Codecov Report
@@ Coverage Diff @@
## v0.6.0-dev #239 +/- ##
==============================================
- Coverage 93.07% 91.61% -1.47%
==============================================
Files 9 9
Lines 650 739 +89
==============================================
+ Hits 605 677 +72
- Misses 45 62 +17
Continue to review full report at Codecov.
|
rdt/transformers/__init__.py
Outdated
@@ -25,6 +25,14 @@ | |||
transformer.__name__: transformer | |||
for transformer in BaseTransformer.__subclasses__() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could call BaseTransformers.get_subclasses()
here, so TRANSFORMERS
is a list of all transformers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also could we add 'GaussianCopulaTransformer
to the __all__
list above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! I just added a couple of minor comments.
rdt/transformers/__init__.py
Outdated
for transformer in transformer_classes: | ||
try: | ||
input_type = transformer.get_input_type() | ||
transformers_for_type = data_type_transformers.get(input_type, []) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: We could use a defaultdict(list)
:
data_type_transformers = defaultdict(list)
...
data_type_transformers[input_type].append(transformer)
Or setdefault
:
data_type_transformers.setdefault(input_type, []).append(transformer)
for subclass in cls.__subclasses__(): | ||
if abc.ABC not in subclass.__bases__: | ||
subclasses.append(subclass) | ||
subclasses += subclass.get_subclasses() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a blank line above this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, meant to request changes.
dbb1162
to
9b2c4f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great now! I just added a comment about functools dependency. We may need to do nothing about it, though
@@ -1,5 +1,8 @@ | |||
"""Transformers module.""" | |||
|
|||
from collections import defaultdict | |||
from functools import lru_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if functools
is part of the standard library. If it is not, we should add it to setup.py
, even if we already install it because one of our dependencies use it, so if they remove it in the future we do not crash.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is part of the standard library
resolves #232
This PR adds the following attributes to the
HyperTransformer
_DTYPES_TO_DATA_TYPES
- a dict mapping the pandas dtypes to an RDT data typeDEFAULT_TRANSFORMERS
- a dict mapping the RDT data types to a default transformerIt also adds the following static method
get_transformers_by_type
- a function that loops through all existing transformers and creates a dict mapping data types to a list of valid transformers for that type.