-
Notifications
You must be signed in to change notification settings - Fork 289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handling Input to auto pytorch #89
Handling Input to auto pytorch #89
Conversation
@@ -19,7 +19,9 @@ def fit(self, X: Dict[str, Any], y: Any = None) -> BaseEncoder: | |||
|
|||
self.check_requirements(X, y) | |||
|
|||
self.preprocessor['categorical'] = OE(categories=X['dataset_properties']['categories']) | |||
self.preprocessor['categorical'] = OE(handle_unknown='use_encoded_value', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we dont need an ordinal encoder in the search space as the data has already gone through this phase. In case for a dataset, OneHotEncoding is not preferred, NoEncoder will have the same effect as doing Ordinal encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh no wait, we do not touch the numerical columns. The objective here is to make the data numerical.
So if you have a numerical columns with values like 0.0000000001 and so on, maybe changing that to 1, 2, 3, 4 and so one with an ordinal encoder improves the performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but in its current form, its just working on the categorical columns. Also, that could be handled with the scaler component. Me and sebastian also discussed today about not having an encoding component. As when we merge the embedding PR which I'll finish soon, PyTorch has this nn.Embedding layer which performs OneHotEncoding and learns the embedding all by itself.
A holdout set of labels | ||
""" | ||
if not self.is_classification or self.type_of_target == 'multilabel-indicator': | ||
# Only fit an encoder for classification tasks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also raise an error if in regression the data type of the target is not integer? Currently it seems like there is no check for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think #85 can handle this. Now I do not know what you mean exactly, as we have different regression targets. Usually we expected regression to have float type, which is something that the code handles right now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I meant if somehow someone gives y values as a string. Which does not contain numbers but words, by mistake. I think in this validator module we should take care of this as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR looks really good. I have left a few minor comments to be addressed.
Enable any input type to autopytorch (
if input in (pd.DataFrame, List, sparse, numpy)
)Better error to user and plenty of more testing (included moving dataset to pytest)