One hot encoder #76

gaxler · 2021-01-26T18:00:52Z

This PR implements OneHotEncoder for single class per label (#58 (comment))

Encoder supports any Hash+Clone types as labels and prduces RealNumber vectors.

Will wait for feedback before implementing multi-class encoders.

VolodymyrOrlov

Hi @gaxler, thank you for contributing your time to work on #58!

I think this PR is close to be merged into the development branch but I would like to request a couple of additional changes from you if you don't mind.

I'd like to make sure that API of the OneHotEncoder is in alignment with API of the rest of algorithms, as well as with Scikit learn's API.

Take a look at the api module that summarizes interface and functions that are used to manipulate data and perform machine learning in Smartcore. Our API was largely modeled after Scikit's API.

Another good place to look is Scikit's OneHotEncoder.

What we want is to have methods fit and transform that can be used to fit and transform entire dataframe.

This API compartability with Scikit makes easier switching to Smartcore and will let us integrate with Scikit Learn later down the road.

Feel free to add any additional functions that you feel are useful for Smartcore users

VolodymyrOrlov · 2021-01-27T01:13:37Z

src/preprocessing/target_encoders.rs

+
+impl<'a, LabelType: Hash + Eq + Clone> OneHotEncoder<LabelType> {
+    /// Fit an encoder to a lable list
+    pub fn fit(labels: &[LabelType]) -> Self {


Can you please rename this method? Method fit is reserved for an operation that estimates parameters of a transformer from n-dimensional array. Here is a good example of how method fit should look like https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L119

VolodymyrOrlov · 2021-01-27T01:18:14Z

src/preprocessing/target_encoders.rs

+
+    /// Transform a slice of label types into one-hot vectors
+    /// None is returned if unknown label is encountered
+    pub fn transform<U: RealNumber>(&self, labels: &[LabelType]) -> Vec<Option<Vec<U>>> {


Can you please rename this method? Method transform is reserved for an operation that 2-dimensional array and transforms every single row using one-hot encoding.
Here is a good example of how method transform should look like https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L126

src/preprocessing/target_encoders.rs

gaxler · 2021-01-27T17:48:18Z

@VolodymyrOrlov Thank You for taking the time to review this, appreciate the feed-back!

Let me make sure that I fully understand your comments:

We do want to have a consistent api with fit and transform for one-hot encoding, but those must act on entire dataframes as input. My implementation acts on a series (if we stick to the dataframe analogy), so I'll rename those. and make the fit and transform methods to act on a dataframe.

I'm a bit confused about what the dataframe should be in this case. smartcore models act on vectors and matricies of RealNumber traits. I attempted to replicate Scikit's implementation where by and large you can encode any type of category.

The simplest solution I can think of is to make OneHotEncoder support only RealNumber categories and have the user define which ones are categorical.
What do you think?

…ealNumber

…tegory.

VolodymyrOrlov · 2021-01-27T21:12:47Z

@VolodymyrOrlov Thank You for taking the time to review this, appreciate the feed-back!

Let me make sure that I fully understand your comments:

We do want to have a consistent api with fit and transform for one-hot encoding, but those must act on entire dataframes as input. My implementation acts on a series (if we stick to the dataframe analogy), so I'll rename those. and make the fit and transform methods to act on a dataframe.

I'm a bit confused about what the dataframe should be in this case. smartcore models act on vectors and matricies of RealNumber traits. I attempted to replicate Scikit's implementation where by and large you can encode any type of category.

The simplest solution I can think of is to make OneHotEncoder support only RealNumber categories and have the user define which ones are categorical.
What do you think?

@gaxler This is correct, we'd like to support both, series and dataframes.

I agree with you. OneHotEncoder should one-hot-encode not only floats but also other types, like integers and strings. The problem is that current design is not flexible enough to support all these types.

There is an easy and hard paths, depending on how much time you have to work on this feature.

The easy path is to convert only nominal and ordinal floats. In this case you don't have to refactor Smartcore interfaces. You can either infer categorical columns from data (e.g. by taking a random sample and looking at values in the sample), or take a list of column IDs that should be encoded from user, as you've suggested. In fact, I think we should do both.

The hard path is to introduce two new types: dataframe and series. These types will be modeled after Pandas DataFrame and Series and will encapsulate data as a matrix (or as a vector) and additional metadata, like column names and types. Later we can switch all Smartcore algorithms to use these new types instead of matrices and vectors. Since most algorithms cannot work with categorical data directly we'll have to implement internal converters that will extract data from dataframe and series and transform these values into floats or throw an error, if it is not possible.

Let me know what you think about these options. Also, feel free to connect with me in Discord, if you have an account there. My ID is volodymyr.orlov#7062

gaxler · 2021-01-27T23:00:15Z

I think I'll start with the easy path, just to make it usable. Afterwards, would love to work on the new data types.

…iables.

…egorical variables. Since we only support RealNumbers for now, the idea is to treat round numbers as ordinal (or nominal if user chooses to ignore order) categories.

VolodymyrOrlov

@gaxler LGTM! Let me know if you would like me to merge this PR into development or prefer to keep it open for a while.

(changed the order of coping, first do the categorical, than copy ther rest)

gaxler · 2021-02-01T19:30:54Z

Thanks @VolodymyrOrlov !
It seems that we are really close to also doing #59
Maybe we can do it in a single PR?

… to SeriesEncoders

VolodymyrOrlov · 2021-02-05T21:52:39Z

src/preprocessing/mod.rs

@@ -0,0 +1,5 @@
+/// Transform a data matrix by replaceing all categorical variables with their one-hot vector equivalents
+pub mod categorical_encoder;


The module name is a bit too long. Can you rename it to either categorical or one_hot_encoder?

VolodymyrOrlov · 2021-02-05T21:57:30Z

src/preprocessing/categorical_encoder.rs

+
+/// OneHotEncoder Parameters
+#[derive(Debug, Clone)]
+pub struct OneHotEncoderParams {


Would you mind implementing a builder and Default for this struct? Something similar to https://github.com/smartcorelib/smartcore/blob/development/src/decomposition/pca.rs#L98?
Right now you have one parameter but with many parameters instantiation of the struct may be a bit easier to do if you have default + builder.

This will be a bit of a problem. The parameter doesn't have any reasonable defaults. It just indicates what columns of a matrix represent categories

I see. In this case please ignore this suggestion.

gaxler added 6 commits January 25, 2021 23:33

build one-hot encoder

9916318

fmt fix

dbca6d4

cliipy fixes

139bbae

fmt fix

0df797c

fixed docs

7daf536

codecov-fix

9833a2f

VolodymyrOrlov requested changes Jan 27, 2021

View reviewed changes

morenol reviewed Jan 27, 2021

View reviewed changes

src/preprocessing/target_encoders.rs Outdated Show resolved Hide resolved

gaxler added 3 commits January 27, 2021 12:03

Genertic make_one_hot. Current implementation returns BaseVector of R…

244a724

…ealNumber

remoe LabelDefinition, looks like unnecesery abstraction for now

19088b6

Renaming fit/transform for API compatibility. Also rename label to ca…

6109fc5

…tegory.

gaxler added 13 commits January 27, 2021 19:31

Rename series encoder and move to separate module file

408b97d

Scaffold for turniing floats to hashable and fittinng to columns

5c400f4

fit SeriesOneHotEncoders to predefined columns

f91b1f9

Documentation updates

3480e72

Adapt column numbers to the new columns introduced by categorical var…

3dc8a42

…iables.

Categorizable trait defines logic of turning floats into hashable cat…

dd39433

…egorical variables. Since we only support RealNumbers for now, the idea is to treat round numbers as ordinal (or nominal if user chooses to ignore order) categories.

Fit OneHotEncoder

cd56110

Transform matrix

fd6b2e8

tests + force Categorizable be RealNumber

c987d39

module name change

2f03c1d

Clippy fixes

ca0816d

style fixes

863be5e

fmt

f4b5936

VolodymyrOrlov previously approved these changes Feb 1, 2021

View reviewed changes

gaxler added 2 commits February 1, 2021 11:20

If transform fails - fail before copying the whole matrix

a882741

(changed the order of coping, first do the categorical, than copy ther rest)

Doc+Naming Improvement

03b9f76

fmt

228b54b

gaxler added 8 commits February 2, 2021 17:40

Separate mapper object

19ff6df

Define common series encoder behavior

d31145b

doc update

237b116

Switch to use SeriesEncoder trait

ef06f45

simplify SeriesEncoder trait

700d320

Move all functionality to CategoryMapper (one-hot and ordinal).

3cc20fd

No more SeriesEncoders.

374dfec

Use CategoryMapper to transform an iterator. No more passing iterator…

828df4e

… to SeriesEncoders

gaxler dismissed VolodymyrOrlov’s stale review via 828df4e February 5, 2021 02:32

VolodymyrOrlov mentioned this pull request Feb 5, 2021

No implementation of Display for Dataset #77

Open

VolodymyrOrlov previously approved these changes Feb 5, 2021

View reviewed changes

gaxler added 2 commits February 9, 2021 22:01

rename categorical

af6ec2d

remove old

6b5bed6

gaxler dismissed VolodymyrOrlov’s stale review via 6b5bed6 February 10, 2021 06:02

VolodymyrOrlov approved these changes Feb 12, 2021

View reviewed changes

VolodymyrOrlov merged commit 745d0b5 into smartcorelib:development Feb 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

One hot encoder #76

One hot encoder #76

Uh oh!

gaxler commented Jan 26, 2021

Uh oh!

VolodymyrOrlov left a comment •

edited

Loading

Uh oh!

VolodymyrOrlov Jan 27, 2021

Uh oh!

VolodymyrOrlov Jan 27, 2021

Uh oh!

Uh oh!

gaxler commented Jan 27, 2021

Uh oh!

VolodymyrOrlov commented Jan 27, 2021

Uh oh!

gaxler commented Jan 27, 2021

Uh oh!

VolodymyrOrlov left a comment

Uh oh!

gaxler commented Feb 1, 2021 •

edited

Loading

Uh oh!

VolodymyrOrlov Feb 5, 2021

Uh oh!

VolodymyrOrlov Feb 5, 2021

Uh oh!

gaxler Feb 7, 2021

Uh oh!

VolodymyrOrlov Feb 10, 2021

Uh oh!

Uh oh!

		@@ -0,0 +1,5 @@
		/// Transform a data matrix by replaceing all categorical variables with their one-hot vector equivalents
		pub mod categorical_encoder;

One hot encoder #76

One hot encoder #76

Uh oh!

Conversation

gaxler commented Jan 26, 2021

Uh oh!

VolodymyrOrlov left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VolodymyrOrlov Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

VolodymyrOrlov Jan 27, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gaxler commented Jan 27, 2021

Uh oh!

VolodymyrOrlov commented Jan 27, 2021

Uh oh!

gaxler commented Jan 27, 2021

Uh oh!

VolodymyrOrlov left a comment

Choose a reason for hiding this comment

Uh oh!

gaxler commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

VolodymyrOrlov Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

VolodymyrOrlov Feb 5, 2021

Choose a reason for hiding this comment

Uh oh!

gaxler Feb 7, 2021

Choose a reason for hiding this comment

Uh oh!

VolodymyrOrlov Feb 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

VolodymyrOrlov left a comment •

edited

Loading

gaxler commented Feb 1, 2021 •

edited

Loading