Update `FloatFormatter` with parameters for the computer representation #521

npatki · 2022-06-20T20:27:05Z

Problem Description

As a user, I want to make sure the min/max values in the reverse transform can be represented by the machine.

Expected behavior

Add the following parameter to FloatFormatter:

computer_type: Default ('Float'). Accepts: 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64', 'Float'

Functionality Changes:

During fit, store the original dtype of the column
During the transform, convert everything to Float for machine learning purposes.
During the reverse_transform:Cast back to the original dtype. As an extra measure: Clip the values to the min and max machine limits for the given computer_representation, and round back to whole numbers if needed. (This is a no-op for Float.) Note if: learn_min_max_bounds=True, then use the learned values instead.

Note: The dtype may be different than the computer representation. For example, pandas might have read in a column as Int64 by default but the user might be telling us it's supposed to be UInt8. Always defer to the parameter, not the dtype.

Errors
During fit or transform: Throw an error if the data is out of bounds according to the computer representation. Note: It does not matter what the actual pandas dtype is, only what the computer representation parameter is.

# limits correspond to uint8
transformer = FloatFormatter(computer_representation='UInt8')
transformer.fit(data, column='test')
Error: The minimum value in column 'test' is -5. All values represented by 'UInt8' must be in the range [0, 255].

transformer.transform(data)
Error: The minimum value in column 'test' is -5. All values represented by 'UInt8' must be in the range [0, 255].

Additional context

Info about pandas dtypes here and here
A reference for standard min/max values: here

See #518 as an example for where this fails today.

The text was updated successfully, but these errors were encountered:

fealho · 2022-08-12T21:22:08Z

@npatki @amontanez24 Should we also validate whether the data type is correct? E.g. if the data contains floats but computer_representation is given as an integer, should we raise an error during fit/transform?

npatki · 2022-08-12T21:45:47Z

@fealho Right the main things to check for would be if the user-specified value cannot possibly be correct. We would do this in the fit or transform.

There are only 2 cases I can think of:

Eg. if the user said something is an Int8, all values should be between 0 and 255. Same for all other Int types.
If the user has specified something is an Int, there can be no fractions. Eg. it's ok for a value of 2.000 but it is not ok to have a value of 2.5.

I think we are already doing (1) and I am in favor of adding (2).

npatki added the feature request Request for a new feature label Jun 20, 2022

npatki changed the title ~~Update FloatFormatter with min/max parameters~~ Update FloatFormatter with parameters for the computer representation Jun 20, 2022

fealho mentioned this issue Aug 12, 2022

Add computer_representation parameter #536

Merged

fealho closed this as completed in #536 Aug 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update `FloatFormatter` with parameters for the computer representation #521

Update `FloatFormatter` with parameters for the computer representation #521

npatki commented Jun 20, 2022 •

edited

Loading

fealho commented Aug 12, 2022 •

edited

Loading

npatki commented Aug 12, 2022

Update FloatFormatter with parameters for the computer representation #521

Update FloatFormatter with parameters for the computer representation #521

Comments

npatki commented Jun 20, 2022 • edited Loading

Problem Description

Expected behavior

Additional context

fealho commented Aug 12, 2022 • edited Loading

npatki commented Aug 12, 2022

Update `FloatFormatter` with parameters for the computer representation #521

Update `FloatFormatter` with parameters for the computer representation #521

npatki commented Jun 20, 2022 •

edited

Loading

fealho commented Aug 12, 2022 •

edited

Loading