-
Notifications
You must be signed in to change notification settings - Fork 86
One hot encoder #76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
One hot encoder #76
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @gaxler, thank you for contributing your time to work on #58!
I think this PR is close to be merged into the development branch but I would like to request a couple of additional changes from you if you don't mind.
I'd like to make sure that API of the OneHotEncoder
is in alignment with API of the rest of algorithms, as well as with Scikit learn's API.
Take a look at the api
module that summarizes interface and functions that are used to manipulate data and perform machine learning in Smartcore. Our API was largely modeled after Scikit's API.
Another good place to look is Scikit's OneHotEncoder
.
What we want is to have methods fit
and transform
that can be used to fit and transform entire dataframe.
This API compartability with Scikit makes easier switching to Smartcore and will let us integrate with Scikit Learn later down the road.
Feel free to add any additional functions that you feel are useful for Smartcore users
src/preprocessing/target_encoders.rs
Outdated
|
||
impl<'a, LabelType: Hash + Eq + Clone> OneHotEncoder<LabelType> { | ||
/// Fit an encoder to a lable list | ||
pub fn fit(labels: &[LabelType]) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rename this method? Method fit
is reserved for an operation that estimates parameters of a transformer from n-dimensional array. Here is a good example of how method fit
should look like https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L119
src/preprocessing/target_encoders.rs
Outdated
|
||
/// Transform a slice of label types into one-hot vectors | ||
/// None is returned if unknown label is encountered | ||
pub fn transform<U: RealNumber>(&self, labels: &[LabelType]) -> Vec<Option<Vec<U>>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please rename this method? Method transform
is reserved for an operation that 2-dimensional array and transforms every single row using one-hot encoding.
Here is a good example of how method transform
should look like https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L126
@VolodymyrOrlov Thank You for taking the time to review this, appreciate the feed-back! Let me make sure that I fully understand your comments: We do want to have a consistent api with I'm a bit confused about what the dataframe should be in this case. smartcore models act on vectors and matricies of The simplest solution I can think of is to make |
@gaxler This is correct, we'd like to support both, series and dataframes. I agree with you. OneHotEncoder should one-hot-encode not only floats but also other types, like integers and strings. The problem is that current design is not flexible enough to support all these types. There is an easy and hard paths, depending on how much time you have to work on this feature. The easy path is to convert only nominal and ordinal floats. In this case you don't have to refactor Smartcore interfaces. You can either infer categorical columns from data (e.g. by taking a random sample and looking at values in the sample), or take a list of column IDs that should be encoded from user, as you've suggested. In fact, I think we should do both. The hard path is to introduce two new types: dataframe and series. These types will be modeled after Pandas DataFrame and Series and will encapsulate data as a matrix (or as a vector) and additional metadata, like column names and types. Later we can switch all Smartcore algorithms to use these new types instead of matrices and vectors. Since most algorithms cannot work with categorical data directly we'll have to implement internal converters that will extract data from dataframe and series and transform these values into floats or throw an error, if it is not possible. Let me know what you think about these options. Also, feel free to connect with me in Discord, if you have an account there. My ID is volodymyr.orlov#7062 |
I think I'll start with the easy path, just to make it usable. Afterwards, would love to work on the new data types. |
…egorical variables. Since we only support RealNumbers for now, the idea is to treat round numbers as ordinal (or nominal if user chooses to ignore order) categories.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gaxler LGTM! Let me know if you would like me to merge this PR into development or prefer to keep it open for a while.
(changed the order of coping, first do the categorical, than copy ther rest)
Thanks @VolodymyrOrlov ! |
… to SeriesEncoders
src/preprocessing/mod.rs
Outdated
@@ -0,0 +1,5 @@ | |||
/// Transform a data matrix by replaceing all categorical variables with their one-hot vector equivalents | |||
pub mod categorical_encoder; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The module name is a bit too long. Can you rename it to either categorical
or one_hot_encoder
?
|
||
/// OneHotEncoder Parameters | ||
#[derive(Debug, Clone)] | ||
pub struct OneHotEncoderParams { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind implementing a builder and Default
for this struct? Something similar to https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L98?
Right now you have one parameter but with many parameters instantiation of the struct may be a bit easier to do if you have default + builder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be a bit of a problem. The parameter doesn't have any reasonable defaults. It just indicates what columns of a matrix represent categories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In this case please ignore this suggestion.
This PR implements OneHotEncoder for single class per label (#58 (comment))
Encoder supports any Hash+Clone types as labels and prduces RealNumber vectors.
Will wait for feedback before implementing multi-class encoders.