Create `validate_transformer_quality` function #253

csala · 2021-09-24T16:49:22Z

A function should be implemented to automatically validate the data quality of any new Transformer by running the quality tests mentioned in #252 and reporting the results.

Function Name and Module

The function should be implemented inside tests/contributing.py and should be called validate_transformer_quality

Inputs

The function should accept a single input:

transformer (class or str): Transformer class or full Python name of the class (e.g. DatetimeTransformer or "rdt.transformers.time.DatetimeTransformer")

Outputs

The function should return pandas.DataFrame that contains information about the results obtained by the Transformer. The dataframe contains one row per dataset, and the following columns (column names here):

Dataset Name: The name of the dataset
Score: The score obtained on the dataset (TBD)
Acceptable: Whether the value is above the rejection threshold (TBD)
Compared to Average: Percentage of how the quality compares to the average of all the other transformers, where 1 means the quality is the same, >1 means the quality is worse, <1 means the quality is better.

Output DataFrame Example

Dataset Name	Score	Acceptable	Compared to Average
Dataset A	0.6534	Yes	1.345
Dataset B	0.4423	Yes	1.234
Dataset C	0.3498	No	0.548

Behavior

This function runs all the quality tests using the Transformer on all the real world datasets that contain the Transformer data type and produces a report based on how good the correlations are preserved and how good a synthetic data generator (a copulas.GaussianMultivariate?) is when trained on the data produced by this Transformer, also comparing it to the quality of the other transformers of the same Data Type

Prints to console

The function prints the following information in the console:

The Transformer that is being validated
Whether the Quality tests were successful or not

Usage Example

The text was updated successfully, but these errors were encountered:

amontanez24 · 2021-10-15T00:00:58Z

When scoring the quality of a transformer on a dataset, we have been using the coefficient of determination for predicting each of the numeric columns in that dataset. I am unsure of how to compile this into a table like the one described above, because there are multiple scores for each dataset. Averaging them doesn't really make sense, since many of them might be close to 0 or even negative. Taking the max also doesn't make sense since one transformer might be better at predicting column A while another transformer of the same data type might be better at predicting column B and those scores might not be that close.

It is also worth noting that we tried predicting all the numeric columns together to see if that would yield one score per dataset, but it ended up just yielding bad scores for everything.

csala added the feature request Request for a new feature label Sep 24, 2021

csala mentioned this issue Sep 28, 2021

Implement Quality Tests for Transformers #252

Closed

amontanez24 mentioned this issue Oct 19, 2021

Create validate_transformer_quality function #299

Merged

amontanez24 self-assigned this Oct 26, 2021

amontanez24 added this to the 0.6.0 milestone Oct 26, 2021

amontanez24 closed this as completed Oct 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create `validate_transformer_quality` function #253

Create `validate_transformer_quality` function #253

csala commented Sep 24, 2021

amontanez24 commented Oct 15, 2021 •

edited

Loading

Create validate_transformer_quality function #253

Create validate_transformer_quality function #253

Comments

csala commented Sep 24, 2021

Function Name and Module

Inputs

Outputs

Output DataFrame Example

Behavior

Prints to console

Usage Example

amontanez24 commented Oct 15, 2021 • edited Loading

Create `validate_transformer_quality` function #253

Create `validate_transformer_quality` function #253

amontanez24 commented Oct 15, 2021 •

edited

Loading