Create ways for HyperTransformer to know which transformers to apply to each data type #232 #239

amontanez24 · 2021-09-21T21:23:07Z

resolves #232

This PR adds the following attributes to the HyperTransformer

_DTYPES_TO_DATA_TYPES - a dict mapping the pandas dtypes to an RDT data type
DEFAULT_TRANSFORMERS - a dict mapping the RDT data types to a default transformer

It also adds the following static method

get_transformers_by_type - a function that loops through all existing transformers and creates a dict mapping data types to a list of valid transformers for that type.

csala

I added a couple of comments to suggest a different approach to this implementation.

csala · 2021-09-23T11:41:38Z

rdt/hyper_transformer.py

+    }
+
+    @staticmethod
+    def get_transformers_by_type():


I think it would make more sense to put this as a function inside the transformers module, either directly in the __init__ module or in a utils module and then importing it from the __init__.

I imagine the use case where someone simply wants to explore what transformers exist, without using the HyperTransformer

>>> import rdt >>> rdt.transformers.get_transformers_by_type() ...

csala · 2021-09-23T11:47:04Z

rdt/hyper_transformer.py

+                    type as an input.
+        """
+        data_type_transformers = {}
+        transformer_classes = inspect.getmembers(sys.modules[__name__], inspect.isclass)


Rather than using inspect, I would navigate the subclasses of BaseTransformer by adding a get_subclasses classmethod to it.

Something similar is done in SDGym: https://github.com/sdv-dev/SDGym/blob/79321b57a1b2dd416426fc9088e1ff771dd82e9c/sdgym/synthesizers/base.py#L17

Also, as a note, when implementing it here I would merge the get_subclasses and get_baselines logic, so only the classes that do not inherit from ABC directly are included in the output. By doing it this way, we can skip any intermediate abstract Transformers that may end up defining.

csala · 2021-09-23T11:49:59Z

rdt/hyper_transformer.py

@@ -73,6 +75,42 @@ class HyperTransformer:
        'b': 'boolean',
        'M': 'datetime',
    }
+    DEFAULT_TRANSFORMERS = {


I think this could also be moved over to the transformers module, and create an additional get_default_transformer(data_type) function that has the functionality of looking up the given data_type on this dictionary and if not found calls the get_transformers_by_type function to choose the first one that it finds that supports it.

I don't like the idea of calling get_transformers_by_type every time we don't have a defined transformer.

What I'll do is add a function that creates the default dict using the DEFAULT_TRANSFORMERS and get_transformers_by_type dicts, and we can just use that dict in the HyperTransformer

What I'll do is add a function that creates the default dict using the DEFAULT_TRANSFORMERS and get_transformers_by_type dicts, and we can just use that dict in the HyperTransformer

I had not seen this. I agree, calling it every time sounds overkill, but somehow I do not like the idea of also calling the function when the HyperTransformer is imported, or every time an instance is created.

What do you think about using functools.cache?
We could actually have get_default_transformers() which caches its result the first time it is called, but then also have get_default_transformer(transformer), which also caches its results, and that under the hood calls the other one and gets the transformer from the dict.

By doing this, the result is the behavior is the same (we end up having the dictionary stored somewhere instead of building it every time), but its initialization is lazy (the dict gets build on the fly the first time it is used)

Makes sense. Let me push a commit and you can take a look

csala · 2021-09-23T11:50:55Z

rdt/hyper_transformer.py

@@ -73,6 +75,42 @@ class HyperTransformer:
        'b': 'boolean',
        'M': 'datetime',
    }
+    DEFAULT_TRANSFORMERS = {
+        'numerical': NumericalTransformer,
+        'integer': NumericalTransformer(dtype=int),


Maybe we should create specific transformers for each configuration, so this dictionary can contain classes instead of instances? I'm opening a new issue to discuss this there.

codecov-commenter · 2021-09-23T18:38:01Z

Codecov Report

Merging #239 (9b2c4f6) into v0.6.0-dev (03ecfec) will decrease coverage by 1.46%.
The diff coverage is 50.00%.

@@              Coverage Diff               @@
##           v0.6.0-dev     #239      +/-   ##
==============================================
- Coverage       93.07%   91.61%   -1.47%     
==============================================
  Files               9        9              
  Lines             650      739      +89     
==============================================
+ Hits              605      677      +72     
- Misses             45       62      +17

Impacted Files	Coverage Δ
rdt/transformers/__init__.py	`60.86% <30.76%> (-39.14%)`	⬇️
rdt/hyper_transformer.py	`100.00% <100.00%> (ø)`
rdt/transformers/base.py	`84.53% <100.00%> (+1.58%)`	⬆️
rdt/transformers/boolean.py	`100.00% <0.00%> (ø)`
rdt/transformers/null.py	`98.07% <0.00%> (+1.92%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 03ecfec...9b2c4f6. Read the comment docs.

katxiao · 2021-09-24T15:34:53Z

rdt/transformers/__init__.py

@@ -25,6 +25,14 @@
    transformer.__name__: transformer
    for transformer in BaseTransformer.__subclasses__()


could call BaseTransformers.get_subclasses() here, so TRANSFORMERS is a list of all transformers.

Also could we add 'GaussianCopulaTransformer to the __all__ list above?

It's already there

csala

LGTM! I just added a couple of minor comments.

csala · 2021-09-24T15:37:28Z

rdt/transformers/__init__.py

+    for transformer in transformer_classes:
+        try:
+            input_type = transformer.get_input_type()
+            transformers_for_type = data_type_transformers.get(input_type, [])


nit: We could use a defaultdict(list):

data_type_transformers = defaultdict(list) ... data_type_transformers[input_type].append(transformer)

Or setdefault:

data_type_transformers.setdefault(input_type, []).append(transformer)

csala · 2021-09-24T15:38:09Z

rdt/transformers/base.py

+        for subclass in cls.__subclasses__():
+            if abc.ABC not in subclass.__bases__:
+                subclasses.append(subclass)
+            subclasses += subclass.get_subclasses()


Can we add a blank line above this one?

katxiao

Oops, meant to request changes.

csala

Looking great now! I just added a comment about functools dependency. We may need to do nothing about it, though

csala · 2021-09-24T16:37:03Z

rdt/transformers/__init__.py

@@ -1,5 +1,8 @@
 """Transformers module."""

+from collections import defaultdict
+from functools import lru_cache


I'm not sure if functools is part of the standard library. If it is not, we should add it to setup.py, even if we already install it because one of our dependencies use it, so if they remove it in the future we do not crash.

It is part of the standard library

…to each data type #232 (#239) * adding get_transformers_by_type function * adding other attributes and fixing typo * pr comments * adding default transformers method * pr comments * adding caching and some cleanup

amontanez24 changed the base branch from master to update-baseclass September 21, 2021 21:37

csala changed the title ~~RDT - Create ways for HyperTransformer to know which transformers to apply to each data type #232~~ Create ways for HyperTransformer to know which transformers to apply to each data type #232 Sep 22, 2021

Base automatically changed from update-baseclass to v0.6.0-dev September 22, 2021 21:41

amontanez24 force-pushed the rdt-232-data-type-transformers branch from 1b3c158 to 57a1932 Compare September 22, 2021 21:59

csala suggested changes Sep 23, 2021

View reviewed changes

amontanez24 marked this pull request as ready for review September 23, 2021 19:41

csala mentioned this pull request Sep 24, 2021

Add gaussian copula to init #246

Closed

csala requested review from katxiao and a team and removed request for a team September 24, 2021 12:06

katxiao approved these changes Sep 24, 2021

View reviewed changes

csala approved these changes Sep 24, 2021

View reviewed changes

katxiao suggested changes Sep 24, 2021

View reviewed changes

amontanez24 requested a review from katxiao September 24, 2021 15:57

katxiao approved these changes Sep 24, 2021

View reviewed changes

amontanez24 added 9 commits September 24, 2021 11:20

adding get_transformers_by_type function

a2b9ce4

adding other attributes and fixing typo

cc9c38c

lint issues

4fb6e15

pr comments

fbda538

adding default transformers method

d4b7da6

pr comments

857ade7

sorting error

86f238a

adding caching and some cleanup

69ee13d

lint error

9b2c4f6

amontanez24 force-pushed the rdt-232-data-type-transformers branch from dbb1162 to 9b2c4f6 Compare September 24, 2021 16:22

csala approved these changes Sep 24, 2021

View reviewed changes

amontanez24 merged commit 3be22ab into v0.6.0-dev Sep 24, 2021

amontanez24 deleted the rdt-232-data-type-transformers branch September 24, 2021 17:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create ways for HyperTransformer to know which transformers to apply to each data type #232 #239

Create ways for HyperTransformer to know which transformers to apply to each data type #232 #239

amontanez24 commented Sep 21, 2021

csala left a comment

csala Sep 23, 2021

csala Sep 23, 2021

csala Sep 23, 2021

amontanez24 Sep 23, 2021

amontanez24 Sep 23, 2021

csala Sep 24, 2021

amontanez24 Sep 24, 2021

csala Sep 23, 2021

codecov-commenter commented Sep 23, 2021 •

edited

Loading

katxiao Sep 24, 2021

katxiao Sep 24, 2021

amontanez24 Sep 24, 2021

csala left a comment

csala Sep 24, 2021

csala Sep 24, 2021

katxiao left a comment

csala left a comment

csala Sep 24, 2021

amontanez24 Sep 24, 2021

		@@ -25,6 +25,14 @@
		transformer.__name__: transformer
		for transformer in BaseTransformer.__subclasses__()

Create ways for HyperTransformer to know which transformers to apply to each data type #232 #239

Create ways for HyperTransformer to know which transformers to apply to each data type #232 #239

Conversation

amontanez24 commented Sep 21, 2021

csala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 23, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

katxiao left a comment

Choose a reason for hiding this comment

csala left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov-commenter commented Sep 23, 2021 •

edited

Loading