Skip to content

One hot encoder #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Feb 12, 2021
Merged

Conversation

gaxler
Copy link
Collaborator

@gaxler gaxler commented Jan 26, 2021

This PR implements OneHotEncoder for single class per label (#58 (comment))

Encoder supports any Hash+Clone types as labels and prduces RealNumber vectors.

Will wait for feedback before implementing multi-class encoders.

Copy link
Collaborator

@VolodymyrOrlov VolodymyrOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @gaxler, thank you for contributing your time to work on #58!

I think this PR is close to be merged into the development branch but I would like to request a couple of additional changes from you if you don't mind.

I'd like to make sure that API of the OneHotEncoder is in alignment with API of the rest of algorithms, as well as with Scikit learn's API.

Take a look at the api module that summarizes interface and functions that are used to manipulate data and perform machine learning in Smartcore. Our API was largely modeled after Scikit's API.

Another good place to look is Scikit's OneHotEncoder.

What we want is to have methods fit and transform that can be used to fit and transform entire dataframe.

This API compartability with Scikit makes easier switching to Smartcore and will let us integrate with Scikit Learn later down the road.

Feel free to add any additional functions that you feel are useful for Smartcore users


impl<'a, LabelType: Hash + Eq + Clone> OneHotEncoder<LabelType> {
/// Fit an encoder to a lable list
pub fn fit(labels: &[LabelType]) -> Self {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rename this method? Method fit is reserved for an operation that estimates parameters of a transformer from n-dimensional array. Here is a good example of how method fit should look like https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L119


/// Transform a slice of label types into one-hot vectors
/// None is returned if unknown label is encountered
pub fn transform<U: RealNumber>(&self, labels: &[LabelType]) -> Vec<Option<Vec<U>>> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please rename this method? Method transform is reserved for an operation that 2-dimensional array and transforms every single row using one-hot encoding.
Here is a good example of how method transform should look like https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L126

@gaxler
Copy link
Collaborator Author

gaxler commented Jan 27, 2021

@VolodymyrOrlov Thank You for taking the time to review this, appreciate the feed-back!

Let me make sure that I fully understand your comments:

We do want to have a consistent api with fit and transform for one-hot encoding, but those must act on entire dataframes as input. My implementation acts on a series (if we stick to the dataframe analogy), so I'll rename those. and make the fit and transform methods to act on a dataframe.

I'm a bit confused about what the dataframe should be in this case. smartcore models act on vectors and matricies of RealNumber traits. I attempted to replicate Scikit's implementation where by and large you can encode any type of category.

The simplest solution I can think of is to make OneHotEncoder support only RealNumber categories and have the user define which ones are categorical.
What do you think?

@VolodymyrOrlov
Copy link
Collaborator

@VolodymyrOrlov Thank You for taking the time to review this, appreciate the feed-back!

Let me make sure that I fully understand your comments:

We do want to have a consistent api with fit and transform for one-hot encoding, but those must act on entire dataframes as input. My implementation acts on a series (if we stick to the dataframe analogy), so I'll rename those. and make the fit and transform methods to act on a dataframe.

I'm a bit confused about what the dataframe should be in this case. smartcore models act on vectors and matricies of RealNumber traits. I attempted to replicate Scikit's implementation where by and large you can encode any type of category.

The simplest solution I can think of is to make OneHotEncoder support only RealNumber categories and have the user define which ones are categorical.
What do you think?

@gaxler This is correct, we'd like to support both, series and dataframes.

I agree with you. OneHotEncoder should one-hot-encode not only floats but also other types, like integers and strings. The problem is that current design is not flexible enough to support all these types.

There is an easy and hard paths, depending on how much time you have to work on this feature.

The easy path is to convert only nominal and ordinal floats. In this case you don't have to refactor Smartcore interfaces. You can either infer categorical columns from data (e.g. by taking a random sample and looking at values in the sample), or take a list of column IDs that should be encoded from user, as you've suggested. In fact, I think we should do both.

The hard path is to introduce two new types: dataframe and series. These types will be modeled after Pandas DataFrame and Series and will encapsulate data as a matrix (or as a vector) and additional metadata, like column names and types. Later we can switch all Smartcore algorithms to use these new types instead of matrices and vectors. Since most algorithms cannot work with categorical data directly we'll have to implement internal converters that will extract data from dataframe and series and transform these values into floats or throw an error, if it is not possible.

Let me know what you think about these options. Also, feel free to connect with me in Discord, if you have an account there. My ID is volodymyr.orlov#7062

@gaxler
Copy link
Collaborator Author

gaxler commented Jan 27, 2021

I think I'll start with the easy path, just to make it usable. Afterwards, would love to work on the new data types.

VolodymyrOrlov
VolodymyrOrlov previously approved these changes Feb 1, 2021
Copy link
Collaborator

@VolodymyrOrlov VolodymyrOrlov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gaxler LGTM! Let me know if you would like me to merge this PR into development or prefer to keep it open for a while.

(changed the order of coping, first do the categorical, than copy ther rest)
@gaxler
Copy link
Collaborator Author

gaxler commented Feb 1, 2021

Thanks @VolodymyrOrlov !
It seems that we are really close to also doing #59
Maybe we can do it in a single PR?

VolodymyrOrlov
VolodymyrOrlov previously approved these changes Feb 5, 2021
@@ -0,0 +1,5 @@
/// Transform a data matrix by replaceing all categorical variables with their one-hot vector equivalents
pub mod categorical_encoder;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The module name is a bit too long. Can you rename it to either categorical or one_hot_encoder?


/// OneHotEncoder Parameters
#[derive(Debug, Clone)]
pub struct OneHotEncoderParams {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind implementing a builder and Default for this struct? Something similar to https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L98?
Right now you have one parameter but with many parameters instantiation of the struct may be a bit easier to do if you have default + builder.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a bit of a problem. The parameter doesn't have any reasonable defaults. It just indicates what columns of a matrix represent categories

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In this case please ignore this suggestion.

@VolodymyrOrlov VolodymyrOrlov merged commit 745d0b5 into smartcorelib:development Feb 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants