Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling Input to auto pytorch #89

Merged

Conversation

franchuterivera
Copy link
Contributor

  • Enable any input type to autopytorch (if input in (pd.DataFrame, List, sparse, numpy))

  • Better error to user and plenty of more testing (included moving dataset to pytest)

@@ -19,7 +19,9 @@ def fit(self, X: Dict[str, Any], y: Any = None) -> BaseEncoder:

self.check_requirements(X, y)

self.preprocessor['categorical'] = OE(categories=X['dataset_properties']['categories'])
self.preprocessor['categorical'] = OE(handle_unknown='use_encoded_value',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we dont need an ordinal encoder in the search space as the data has already gone through this phase. In case for a dataset, OneHotEncoding is not preferred, NoEncoder will have the same effect as doing Ordinal encoding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh no wait, we do not touch the numerical columns. The objective here is to make the data numerical.

So if you have a numerical columns with values like 0.0000000001 and so on, maybe changing that to 1, 2, 3, 4 and so one with an ordinal encoder improves the performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but in its current form, its just working on the categorical columns. Also, that could be handled with the scaler component. Me and sebastian also discussed today about not having an encoding component. As when we merge the embedding PR which I'll finish soon, PyTorch has this nn.Embedding layer which performs OneHotEncoding and learns the embedding all by itself.

A holdout set of labels
"""
if not self.is_classification or self.type_of_target == 'multilabel-indicator':
# Only fit an encoder for classification tasks
Copy link
Contributor

@ravinkohli ravinkohli Feb 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also raise an error if in regression the data type of the target is not integer? Currently it seems like there is no check for that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think #85 can handle this. Now I do not know what you mean exactly, as we have different regression targets. Usually we expected regression to have float type, which is something that the code handles right now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant if somehow someone gives y values as a string. Which does not contain numbers but words, by mistake. I think in this validator module we should take care of this as well

Copy link
Contributor

@ravinkohli ravinkohli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks really good. I have left a few minor comments to be addressed.

@ravinkohli ravinkohli mentioned this pull request Feb 10, 2021
@ravinkohli ravinkohli merged commit b48b952 into automl:refactor_development Feb 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants