Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update FloatFormatter with parameters for the computer representation #521

Closed
npatki opened this issue Jun 20, 2022 · 2 comments · Fixed by #536
Closed

Update FloatFormatter with parameters for the computer representation #521

npatki opened this issue Jun 20, 2022 · 2 comments · Fixed by #536
Labels
feature request Request for a new feature

Comments

@npatki
Copy link
Contributor

npatki commented Jun 20, 2022

Problem Description

As a user, I want to make sure the min/max values in the reverse transform can be represented by the machine.

Expected behavior

Add the following parameter to FloatFormatter:

  • computer_type: Default ('Float'). Accepts: 'Int8', 'Int16', 'Int32', 'Int64', 'UInt8', 'UInt16', 'UInt32', 'UInt64', 'Float'

Functionality Changes:

  • During fit, store the original dtype of the column
  • During the transform, convert everything to Float for machine learning purposes.
  • During the reverse_transform:Cast back to the original dtype. As an extra measure: Clip the values to the min and max machine limits for the given computer_representation, and round back to whole numbers if needed. (This is a no-op for Float.) Note if: learn_min_max_bounds=True, then use the learned values instead.

Note: The dtype may be different than the computer representation. For example, pandas might have read in a column as Int64 by default but the user might be telling us it's supposed to be UInt8. Always defer to the parameter, not the dtype.

Errors
During fit or transform: Throw an error if the data is out of bounds according to the computer representation. Note: It does not matter what the actual pandas dtype is, only what the computer representation parameter is.

# limits correspond to uint8
transformer = FloatFormatter(computer_representation='UInt8')
transformer.fit(data, column='test')
Error: The minimum value in column 'test' is -5. All values represented by 'UInt8' must be in the range [0, 255].

transformer.transform(data)
Error: The minimum value in column 'test' is -5. All values represented by 'UInt8' must be in the range [0, 255].

Additional context

  • Info about pandas dtypes here and here
  • A reference for standard min/max values: here

See #518 as an example for where this fails today.

@npatki npatki added the feature request Request for a new feature label Jun 20, 2022
@npatki npatki changed the title Update FloatFormatter with min/max parameters Update FloatFormatter with parameters for the computer representation Jun 20, 2022
@fealho
Copy link
Member

fealho commented Aug 12, 2022

@npatki @amontanez24 Should we also validate whether the data type is correct? E.g. if the data contains floats but computer_representation is given as an integer, should we raise an error during fit/transform?

@npatki
Copy link
Contributor Author

npatki commented Aug 12, 2022

@fealho Right the main things to check for would be if the user-specified value cannot possibly be correct. We would do this in the fit or transform.

There are only 2 cases I can think of:

  1. Eg. if the user said something is an Int8, all values should be between 0 and 255. Same for all other Int types.
  2. If the user has specified something is an Int, there can be no fractions. Eg. it's ok for a value of 2.000 but it is not ok to have a value of 2.5.

I think we are already doing (1) and I am in favor of adding (2).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Request for a new feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants