Skip to content

One hot encoder #76

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 35 commits into from
Feb 12, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
9916318
build one-hot encoder
gaxler Jan 26, 2021
dbca6d4
fmt fix
gaxler Jan 26, 2021
139bbae
cliipy fixes
gaxler Jan 26, 2021
0df797c
fmt fix
gaxler Jan 26, 2021
7daf536
fixed docs
gaxler Jan 26, 2021
9833a2f
codecov-fix
gaxler Jan 26, 2021
244a724
Genertic make_one_hot. Current implementation returns BaseVector of R…
gaxler Jan 27, 2021
19088b6
remoe LabelDefinition, looks like unnecesery abstraction for now
gaxler Jan 27, 2021
6109fc5
Renaming fit/transform for API compatibility. Also rename label to ca…
gaxler Jan 27, 2021
408b97d
Rename series encoder and move to separate module file
gaxler Jan 28, 2021
5c400f4
Scaffold for turniing floats to hashable and fittinng to columns
gaxler Jan 28, 2021
f91b1f9
fit SeriesOneHotEncoders to predefined columns
gaxler Jan 28, 2021
3480e72
Documentation updates
gaxler Jan 31, 2021
3dc8a42
Adapt column numbers to the new columns introduced by categorical var…
gaxler Jan 31, 2021
dd39433
Categorizable trait defines logic of turning floats into hashable cat…
gaxler Jan 31, 2021
cd56110
Fit OneHotEncoder
gaxler Jan 31, 2021
fd6b2e8
Transform matrix
gaxler Jan 31, 2021
c987d39
tests + force Categorizable be RealNumber
gaxler Jan 31, 2021
2f03c1d
module name change
gaxler Jan 31, 2021
ca0816d
Clippy fixes
gaxler Jan 31, 2021
863be5e
style fixes
gaxler Jan 31, 2021
f4b5936
fmt
gaxler Jan 31, 2021
a882741
If transform fails - fail before copying the whole matrix
gaxler Feb 1, 2021
03b9f76
Doc+Naming Improvement
gaxler Feb 1, 2021
228b54b
fmt
gaxler Feb 1, 2021
19ff6df
Separate mapper object
gaxler Feb 3, 2021
d31145b
Define common series encoder behavior
gaxler Feb 3, 2021
237b116
doc update
gaxler Feb 3, 2021
ef06f45
Switch to use SeriesEncoder trait
gaxler Feb 3, 2021
700d320
simplify SeriesEncoder trait
gaxler Feb 3, 2021
3cc20fd
Move all functionality to CategoryMapper (one-hot and ordinal).
gaxler Feb 3, 2021
374dfec
No more SeriesEncoders.
gaxler Feb 3, 2021
828df4e
Use CategoryMapper to transform an iterator. No more passing iterator…
gaxler Feb 3, 2021
af6ec2d
rename categorical
gaxler Feb 10, 2021
6b5bed6
remove old
gaxler Feb 10, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,8 @@ pub mod naive_bayes;
/// Supervised neighbors-based learning methods
pub mod neighbors;
pub(crate) mod optimization;
/// Preprocessing utilities
pub mod preprocessing;
/// Support Vector Machines
pub mod svm;
/// Supervised tree-based learning methods
Expand Down
329 changes: 329 additions & 0 deletions src/preprocessing/categorical.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,329 @@
//! # One-hot Encoding For [RealNumber](../../math/num/trait.RealNumber.html) Matricies
//! Transform a data [Matrix](../../linalg/trait.BaseMatrix.html) by replacing all categorical variables with their one-hot equivalents
//!
//! Internally OneHotEncoder treats every categorical column as a series and transforms it using [CategoryMapper](../series_encoder/struct.CategoryMapper.html)
//!
//! ### Usage Example
//! ```
//! use smartcore::linalg::naive::dense_matrix::DenseMatrix;
//! use smartcore::preprocessing::categorical::{OneHotEncoder, OneHotEncoderParams};
//! let data = DenseMatrix::from_2d_array(&[
//! &[1.5, 1.0, 1.5, 3.0],
//! &[1.5, 2.0, 1.5, 4.0],
//! &[1.5, 1.0, 1.5, 5.0],
//! &[1.5, 2.0, 1.5, 6.0],
//! ]);
//! let encoder_params = OneHotEncoderParams::from_cat_idx(&[1, 3]);
//! // Infer number of categories from data and return a reusable encoder
//! let encoder = OneHotEncoder::fit(&data, encoder_params).unwrap();
//! // Transform categorical to one-hot encoded (can transform similar)
//! let oh_data = encoder.transform(&data).unwrap();
//! // Produces the following:
//! // &[1.5, 1.0, 0.0, 1.5, 1.0, 0.0, 0.0, 0.0]
//! // &[1.5, 0.0, 1.0, 1.5, 0.0, 1.0, 0.0, 0.0]
//! // &[1.5, 1.0, 0.0, 1.5, 0.0, 0.0, 1.0, 0.0]
//! // &[1.5, 0.0, 1.0, 1.5, 0.0, 0.0, 0.0, 1.0]
//! ```
use std::iter;

use crate::error::Failed;
use crate::linalg::Matrix;

use crate::preprocessing::data_traits::{CategoricalFloat, Categorizable};
use crate::preprocessing::series_encoder::CategoryMapper;

/// OneHotEncoder Parameters
#[derive(Debug, Clone)]
pub struct OneHotEncoderParams {
/// Column number that contain categorical variable
pub col_idx_categorical: Option<Vec<usize>>,
/// (Currently not implemented) Try and infer which of the matrix columns are categorical variables
infer_categorical: bool,
}

impl OneHotEncoderParams {
/// Generate parameters from categorical variable column numbers
pub fn from_cat_idx(categorical_params: &[usize]) -> Self {
Self {
col_idx_categorical: Some(categorical_params.to_vec()),
infer_categorical: false,
}
}
}

/// Calculate the offset to parameters to due introduction of one-hot encoding
fn find_new_idxs(num_params: usize, cat_sizes: &[usize], cat_idxs: &[usize]) -> Vec<usize> {
// This functions uses iterators and returns a vector.
// In case we get a huge amount of paramenters this might be a problem
// todo: Change this such that it will return an iterator

let cat_idx = cat_idxs.iter().copied().chain((num_params..).take(1));

// Offset is constant between two categorical values, here we calculate the number of steps
// that remain constant
let repeats = cat_idx.scan(0, |a, v| {
let im = v + 1 - *a;
*a = v;
Some(im)
});

// Calculate the offset to parameter idx due to newly intorduced one-hot vectors
let offset_ = cat_sizes.iter().scan(0, |a, &v| {
*a = *a + v - 1;
Some(*a)
});
let offset = (0..1).chain(offset_);

let new_param_idxs: Vec<usize> = (0..num_params)
.zip(
repeats
.zip(offset)
.map(|(r, o)| iter::repeat(o).take(r))
.flatten(),
)
.map(|(idx, ofst)| idx + ofst)
.collect();
new_param_idxs
}

fn validate_col_is_categorical<T: Categorizable>(data: &[T]) -> bool {
for v in data {
if !v.is_valid() {
return false;
}
}
true
}

/// Encode Categorical variavbles of data matrix to one-hot
#[derive(Debug, Clone)]
pub struct OneHotEncoder {
category_mappers: Vec<CategoryMapper<CategoricalFloat>>,
col_idx_categorical: Vec<usize>,
}

impl OneHotEncoder {
/// Create an encoder instance with categories infered from data matrix
pub fn fit<T, M>(data: &M, params: OneHotEncoderParams) -> Result<OneHotEncoder, Failed>
where
T: Categorizable,
M: Matrix<T>,
{
match (params.col_idx_categorical, params.infer_categorical) {
(None, false) => Err(Failed::fit(
"Must pass categorical series ids or infer flag",
)),

(Some(_idxs), true) => Err(Failed::fit(
"Ambigous parameters, got both infer and categroy ids",
)),

(Some(mut idxs), false) => {
// make sure categories have same order as data columns
idxs.sort_unstable();

let (nrows, _) = data.shape();

// col buffer to avoid allocations
let mut col_buf: Vec<T> = iter::repeat(T::zero()).take(nrows).collect();

let mut res: Vec<CategoryMapper<CategoricalFloat>> = Vec::with_capacity(idxs.len());

for &idx in &idxs {
data.copy_col_as_vec(idx, &mut col_buf);
if !validate_col_is_categorical(&col_buf) {
let msg = format!(
"Column {} of data matrix containts non categorizable (integer) values",
idx
);
return Err(Failed::fit(&msg[..]));
}
let hashable_col = col_buf.iter().map(|v| v.to_category());
res.push(CategoryMapper::fit_to_iter(hashable_col));
}

Ok(Self {
category_mappers: res,
col_idx_categorical: idxs,
})
}

(None, true) => {
todo!("Auto-Inference for Categorical Variables not yet implemented")
}
}
}

/// Transform categorical variables to one-hot encoded and return a new matrix
pub fn transform<T, M>(&self, x: &M) -> Result<M, Failed>
where
T: Categorizable,
M: Matrix<T>,
{
let (nrows, p) = x.shape();
let additional_params: Vec<usize> = self
.category_mappers
.iter()
.map(|enc| enc.num_categories())
.collect();

// Eac category of size v adds v-1 params
let expandws_p: usize = p + additional_params.iter().fold(0, |cs, &v| cs + v - 1);

let new_col_idx = find_new_idxs(p, &additional_params[..], &self.col_idx_categorical[..]);
let mut res = M::zeros(nrows, expandws_p);

for (pidx, &old_cidx) in self.col_idx_categorical.iter().enumerate() {
let cidx = new_col_idx[old_cidx];
let col_iter = (0..nrows).map(|r| x.get(r, old_cidx).to_category());
let sencoder = &self.category_mappers[pidx];
let oh_series = col_iter.map(|c| sencoder.get_one_hot::<T, Vec<T>>(&c));

for (row, oh_vec) in oh_series.enumerate() {
match oh_vec {
None => {
// Since we support T types, bad value in a series causes in to be invalid
let msg = format!("At least one value in column {} doesn't conform to category definition", old_cidx);
return Err(Failed::transform(&msg[..]));
}
Some(v) => {
// copy one hot vectors to their place in the data matrix;
for (col_ofst, &val) in v.iter().enumerate() {
res.set(row, cidx + col_ofst, val);
}
}
}
}
}

// copy old data in x to their new location while skipping catergorical vars (already treated)
let mut skip_idx_iter = self.col_idx_categorical.iter();
let mut cur_skip = skip_idx_iter.next();

for (old_p, &new_p) in new_col_idx.iter().enumerate() {
// if found treated varible, skip it
if let Some(&v) = cur_skip {
if v == old_p {
cur_skip = skip_idx_iter.next();
continue;
}
}

for r in 0..nrows {
let val = x.get(r, old_p);
res.set(r, new_p, val);
}
}

Ok(res)
}
}

#[cfg(test)]
mod tests {
use super::*;
use crate::linalg::naive::dense_matrix::DenseMatrix;
use crate::preprocessing::series_encoder::CategoryMapper;

#[test]
fn adjust_idxs() {
assert_eq!(find_new_idxs(0, &[], &[]), Vec::<usize>::new());
// [0,1,2] -> [0, 1, 1, 1, 2]
assert_eq!(find_new_idxs(3, &[3], &[1]), vec![0, 1, 4]);
}

fn build_cat_first_and_last() -> (DenseMatrix<f64>, DenseMatrix<f64>) {
let orig = DenseMatrix::from_2d_array(&[
&[1.0, 1.5, 3.0],
&[2.0, 1.5, 4.0],
&[1.0, 1.5, 5.0],
&[2.0, 1.5, 6.0],
]);

let oh_enc = DenseMatrix::from_2d_array(&[
&[1.0, 0.0, 1.5, 1.0, 0.0, 0.0, 0.0],
&[0.0, 1.0, 1.5, 0.0, 1.0, 0.0, 0.0],
&[1.0, 0.0, 1.5, 0.0, 0.0, 1.0, 0.0],
&[0.0, 1.0, 1.5, 0.0, 0.0, 0.0, 1.0],
]);

(orig, oh_enc)
}

fn build_fake_matrix() -> (DenseMatrix<f64>, DenseMatrix<f64>) {
// Categorical first and last
let orig = DenseMatrix::from_2d_array(&[
&[1.5, 1.0, 1.5, 3.0],
&[1.5, 2.0, 1.5, 4.0],
&[1.5, 1.0, 1.5, 5.0],
&[1.5, 2.0, 1.5, 6.0],
]);

let oh_enc = DenseMatrix::from_2d_array(&[
&[1.5, 1.0, 0.0, 1.5, 1.0, 0.0, 0.0, 0.0],
&[1.5, 0.0, 1.0, 1.5, 0.0, 1.0, 0.0, 0.0],
&[1.5, 1.0, 0.0, 1.5, 0.0, 0.0, 1.0, 0.0],
&[1.5, 0.0, 1.0, 1.5, 0.0, 0.0, 0.0, 1.0],
]);

(orig, oh_enc)
}

#[test]
fn hash_encode_f64_series() {
let series = vec![3.0, 1.0, 2.0, 1.0];
let hashable_series: Vec<CategoricalFloat> =
series.iter().map(|v| v.to_category()).collect();
let enc = CategoryMapper::from_positional_category_vec(hashable_series);
let inv = enc.invert_one_hot(vec![0.0, 0.0, 1.0]);
let orig_val: f64 = inv.unwrap().into();
assert_eq!(orig_val, 2.0);
}
#[test]
fn test_fit() {
let (x, _) = build_fake_matrix();
let params = OneHotEncoderParams::from_cat_idx(&[1, 3]);
let oh_enc = OneHotEncoder::fit(&x, params).unwrap();
assert_eq!(oh_enc.category_mappers.len(), 2);

let num_cat: Vec<usize> = oh_enc
.category_mappers
.iter()
.map(|a| a.num_categories())
.collect();
assert_eq!(num_cat, vec![2, 4]);
}

#[test]
fn matrix_transform_test() {
let (x, expected_x) = build_fake_matrix();
let params = OneHotEncoderParams::from_cat_idx(&[1, 3]);
let oh_enc = OneHotEncoder::fit(&x, params).unwrap();
let nm = oh_enc.transform(&x).unwrap();
assert_eq!(nm, expected_x);

let (x, expected_x) = build_cat_first_and_last();
let params = OneHotEncoderParams::from_cat_idx(&[0, 2]);
let oh_enc = OneHotEncoder::fit(&x, params).unwrap();
let nm = oh_enc.transform(&x).unwrap();
assert_eq!(nm, expected_x);
}

#[test]
fn fail_on_bad_category() {
let m = DenseMatrix::from_2d_array(&[
&[1.0, 1.5, 3.0],
&[2.0, 1.5, 4.0],
&[1.0, 1.5, 5.0],
&[2.0, 1.5, 6.0],
]);

let params = OneHotEncoderParams::from_cat_idx(&[1]);
match OneHotEncoder::fit(&m, params) {
Err(_) => {
assert!(true);
}
_ => assert!(false),
}
}
}
43 changes: 43 additions & 0 deletions src/preprocessing/data_traits.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
//! Traits to indicate that float variables can be viewed as categorical
//! This module assumes

use crate::math::num::RealNumber;

pub type CategoricalFloat = u16;

// pub struct CategoricalFloat(u16);
const ERROR_MARGIN: f64 = 0.001;

pub trait Categorizable: RealNumber {
type A;

fn to_category(self) -> CategoricalFloat;

fn is_valid(self) -> bool;
}

impl Categorizable for f32 {
type A = CategoricalFloat;

fn to_category(self) -> CategoricalFloat {
self as CategoricalFloat
}

fn is_valid(self) -> bool {
let a = self.to_category();
(a as f32 - self).abs() < (ERROR_MARGIN as f32)
}
}

impl Categorizable for f64 {
type A = CategoricalFloat;

fn to_category(self) -> CategoricalFloat {
self as CategoricalFloat
}

fn is_valid(self) -> bool {
let a = self.to_category();
(a as f64 - self).abs() < ERROR_MARGIN
}
}
5 changes: 5 additions & 0 deletions src/preprocessing/mod.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
/// Transform a data matrix by replaceing all categorical variables with their one-hot vector equivalents
pub mod categorical;
mod data_traits;
/// Encode a series (column, array) of categorical variables as one-hot vectors
pub mod series_encoder;
Loading