Releases: rust-ml/linfa
Release 0.6.0
Linfa's 0.6.0 release removes the mandatory dependency on external BLAS libraries (such as intel-mkl
) by using a pure-Rust linear algebra library. It also adds the Naive Multinomial Bayes and Follow The Regularized Leader algorithms. Additionally, the AsTargets
trait has been separated into AsSingleTargets
and AsMultiTargets
.
No more BLAS
With older versions of Linfa, algorithm crates that used advanced linear algebra routines needed to be linked against an external BLAS library such as Intel-MKL. This is done by adding feature flags like linfa/intel-mkl-static
to the build, and it increased the compile times significantly. Version 0.6.0 replaces the BLAS library with a pure-Rust implementation of all the required routines, which Linfa uses by default. This means all Linfa crates now build properly and quickly without any extra feature flags. It is still possible for the affected algorithm crates to link against an external BLAS libary. Doing so requires enabling the crate's blas
feature, along with the feature flag for the external BLAS library. The affected crates are as follows:
linfa-ica
linfa-reduction
linfa-clustering
linfa-preprocessing
linfa-pls
linfa-linear
linfa-elasticnet
New algorithms
Multinomial Naive Bayes is a family of Naive Bayes classifiers that assume independence between variables. The advantage is a linear fitting time with maximum-likelihood training in a closed form. The algorithm is added to linfa-bayes
and an example can be found at linfa-bayes/examples/winequality_multinomial.rs.
Follow The Regularized Leader (FTRL) is a linear model for CTR prediction in online learning settings. It is a special type of linear model with sigmoid function which uses L1 and L2 regularization. The algorithm is contained in the newly-added linfa-ftrl
crate, and an example can be found at linfa-ftrl/examples/winequality.rs.
Distinguish between single and multi-target
Version 0.6.0 introduces a major change to the AsTarget
trait, which is now split into AsSingleTargets
and AsMultiTargets
. Additionally, the Dataset*
types are parametrized by target dimensionality, instead of always using a 2D array. Furthermore, algorithms that work on single-target data will no longer accept multi-target datasets as input. This change may cause build errors in existing code that call the affected algorithms. The fix for it is as simple as adding Ix1
to the end of the type parameters for the dataset being passed in, which forces the dataset to be single-target.
Improvements
- Remove
SeedableRng
trait bound fromKMeans
andGaussianMixture
. - Replace uses of Isaac RNG with Xoshiro RNG.
cross_validate
changed tocross_validate_single
, which is for single-target data;cross_validate_multi
changed tocross_validate
, which is for both single and multi-target datasets.- The probability type
Pr
has been constrained to0. <= prob <= 1.
. Also, the simplePr(x)
constructor has been replaced byPr::new(x)
,Pr::new_unchecked(x)
, andPr::try_from(x)
, which ensure that the invariant forPr
is met.
Release 0.5.1
Release 0.5.1
Linfa's 0.5.1 release fixes errors and bugs in the previous release, as well as removing useless trait bounds on the Dataset
type. Note that the commits for this release are located in the 0-5-1
branch of the GitHub repo.
Improvements
- remove
Float
trait bound from manyDataset
impls, making non-float datasets usable - fix build errors in 0.5.0 caused by breaking minor releases from dependencies
- fix bug in k-means where the termination condition of the algorithm was calculated incorrectly
- fix build failure when building
linfa
alone, caused by incorrect feature selection forndarray
Release 0.5.0
Linfa's 0.5.0 release adds initial support for the OPTICS algorithm, multinomials logistic regression, and the family of nearest neighbor algorithms. Furthermore, we have improved documentation and introduced hyperparameter checking to all algorithms.
New algorithms
OPTICS is an algorithm for finding density-based clusters. It can produce reachability-plots, hierarchical structure of clusters. Analysing data without prior assumption of any distribution is a common use-case. The algorithm is added to linfa-clustering
and an example can be find at linfa-clustering/examples/optics.rs.
Extending logistic regression to the multinomial distribution generalizes it to multiclass problems. This release adds support for multinomial logistic regression to linfa-logistic
, you can experiment with the example at linfa-logistic/examples/winequality_multi.rs.
Nearest neighbor search finds the set of neighborhood points to a given sample. It appears in numerous fields of applications as a distance metric provider. (e.g. clustering) This release adds a family of nearest neighbor algorithms, namely Ball tree, K-d tree and naive linear search. You can find an example in the next section.
Improvements
- use least-square solver from
ndarray-linalg
inlinfa-linear
- make clustering algorithms generic over distance metrics
- bump
ndarray
to 0.15 - introduce
ParamGuard
trait for explicit and implicit parameter checking (read more in the CONTRIBUTE.md) - improve documentation in various places
Nearest Neighbors
You can now choose from a growing list of NN implementations. The family provides efficient distance metrics to KMeans, DBSCAN etc. The example shows how to use KDTree nearest neighbor to find all the points in a set of observations that are within a certain range of a candidate point.
You can query nearest points explicitly:
// create a KDTree index consisting of all the points in the observations, using Euclidean distance
let kdtree = CommonNearestNeighbour::KdTree.from_batch(observations, L2Dist)?;
let candidate = observations.row(2);
let points = kdtree.within_range(candidate.view(), range)?;
Or use one of the distance metrics implicitly, here demonstrated for KMeans:
use linfa_nn::distance::LInfDist;
let model = KMeans::params_with(3, rng, LInfDist)
.max_n_iterations(200)
.tolerance(1e-5)
.fit(&dataset)?;
Release 0.4.0
Linfa's 0.4.0 release introduces four new algorithms, improves documentation of the ICA and K-means implementations, adds more benchmarks to K-Means and updates to ndarray's 0.14 version.
New algorithms
The Partial Least Squares Regression model family is added in this release (thanks to @relf). It projects the observable, as well as predicted variables to a latent space and maximizes the correlation for them. For problems with a large number of targets or collinear predictors it gives a better performance when compared to standard regression. For more information look into the documentation of linfa-pls
.
A wrapper for Barnes-Hut t-SNE is also added in this release. The t-SNE algorithm is often used for data visualization and projects data in a high-dimensional space to a similar representation in two/three dimension. It does so by maximizing the Kullback-Leibler Divergence between the high dimensional source distribution to the target distribution. The Barnes-Hut approximation improves the runtime drastically while retaining the performance. Kudos to github/frjnn for providing an implementation!
A new preprocessing crate makes working with textual data and data normalization easy (thanks to @Sauro98). It implements count-vectorizer and IT-IDF normalization for text pre-processing. Normalizations for signals include linear scaling, norm scaling and whitening with PCA/ZCA/choelsky. An example with a Naive Bayes model achieves 84% F1 score for predicting categories alt.atheism
, talk.religion.misc
, comp.graphics
and sci.space
on a news dataset.
Platt scaling calibrates a real-valued classification model to probabilities over two classes. This is used for the SV classification when probabilities are required. Further a multi class model, combining multiple binary models (e.g. calibrated SVM models) into a single multi-class model is also added. These composing models are moved to the linfa/src/composing/
subfolder.
Improvements
Numerous improvements are added to the KMeans implementation, thanks to @YuhanLiin. The implementation is optimized for offline training, an incremental training model is added and KMeans++/KMeans|| initialization gives good initial cluster means for medium and large datasets.
We also moved to ndarray's version 0.14 and introduced F::cast
for simpler floating point casting. The trait signature of linfa::Fit
is changed such that it always returns a Result
and error handling is added for the linfa-logistic
and linfa-reduction
subcrates.
You often have to compare several model parametrization with k-folding. For this a new function cross_validate
is added which takes the number of folds, model parameters and a closure for the evaluation metric. It automatically calls k-folding and averages the metric over the folds. To compare different L1 ratios of an elasticnet model, you can use it in the following way:
// L1 ratios to compare
let ratios = vec![0.1, 0.2, 0.5, 0.7, 1.0];
// create a model for each parameter
let models = ratios
.iter()
.map(|ratio| ElasticNet::params().penalty(0.3).l1_ratio(*ratio))
.collect::<Vec<_>>();
// get the mean r2 validation score across 5 folds for each model
let r2_values =
dataset.cross_validate(5, &models, |prediction, truth| prediction.r2(&truth))?;
// show the mean r2 score for each parameter choice
for (ratio, r2) in ratios.iter().zip(r2_values.iter()) {
println!("L1 ratio: {}, r2 score: {}", ratio, r2);
}
Other changes
- fix for border points in the DBSCAN implementation
- improved documentation of the ICA subcrate
- prevent overflowing code example in website
Release 0.3.1
In this release of Linfa the documentation is extended, new examples are added and the functionality of datasets improved. No new algorithms were added.
The meta-issue #82 gives a good overview of the necessary documentation improvements and testing/documentation/examples were considerably extended in this release.
Further new functionality was added to datasets and multi-target datasets are introduced. Bootstrapping is now possible for features and samples and you can cross-validate your model with k-folding. We polished various bits in the kernel machines and simplified the interface there.
The trait structure of regression metrics are simplified and the silhouette score introduced for easier testing of K-Means and other algorithms.
Changes
- improve documentation in all algorithms, various commits
- add a website to the infrastructure (c8acc78)
- add k-folding with and without copying (b0af805)
- add feature naming and pearson's cross correlation (7198962)
- improve ergonomics when handling kernels (1a7982b)
- improve TikZ generator in
linfa-trees
(9d71f60) - introduce multi-target datasets (b231118)
- simplify regression metrics and add cluster metrics (d0363a1)
Example
You can now perform cross-validation with k-folding. @Sauro98 actually implemented two versions, one which copies the dataset into k folds and one which avoid excessive memory operations by copying only the validation dataset around. For example to test a model with 8-folding:
// perform cross-validation with the F1 score
let f1_runs = dataset
.iter_fold(8, |v| params.fit(&v).unwrap())
.map(|(model, valid)| {
let cm = model
.predict(&valid)
.mapv(|x| x > Pr::even())
.confusion_matrix(&valid).unwrap();
cm.f1_score()
})
.collect::<Array1<_>>();
// calculate mean and standard deviation
println!("F1 score: {}±{}",
f1_runs.mean().unwrap(),
f1_runs.std_axis(Axis(0), 0.0),
);
Release 0.3.0
New algorithms
- Approximated DBSCAN has been added to
linfa-clustering
by [@Sauro98] - Gaussian Naive Bayes has been added to
linfa-bayes
by [@VasanthakumarV] - Elastic Net linear regression has been added to
linfa-elasticnet
by [@paulkoerbitz] and [@bytesnake]
Changes
- Added benchmark to gaussian mixture models (a3eede5)
- Fixed bugs in linear decision trees, added generator for TiKZ trees (bfa5aeb)
- Implemented serde for all crates behind feature flag (4f0b63b)
- Implemented new backend features (7296c9e)
- Introduced
linfa-datasets
for easier testing (3cec12b) - Rename
Dataset
toDatasetBase
and introduceDataset
andDatasetView
(21dd579) - Improve kernel tests and documentation (8e81a6d)
Example
The following section shows a small example how datasets interact with the training and testing of a Linear Decision Tree.
You can load a dataset, shuffle it and then split it into training and validation sets:
// initialize pseudo random number generator with seed 42
let mut rng = Isaac64Rng::seed_from_u64(42);
// load the Iris dataset, shuffle and split with ratio 0.8
let (train, test) = linfa_datasets::iris()
.shuffle(&mut rng)
.split_with_ratio(0.8);
With the training dataset a linear decision tree model can be trained. Entropy is used as a metric for the optimal split here:
let entropy_model = DecisionTree::params()
.split_quality(SplitQuality::Entropy)
.max_depth(Some(100))
.min_weight_split(10.0)
.min_weight_leaf(10.0)
.fit(&train);
The validation dataset is now used to estimate the error. For this the true labels are predicted and then a confusion matrix gives clue about the type of error:
let cm = entropy_model
.predict(test.records().view())
.confusion_matrix(&test);
println!("{:?}", cm);
println!(
"Test accuracy with Entropy criterion: {:.2}%",
100.0 * cm.accuracy()
);
Finally you can analyze which features were used in the decision and export the whole tree it to a TeX
file. It will contain a TiKZ tree with information on the splitting decision and impurity improvement:
let feats = entropy_model.features();
println!("Features trained in this tree {:?}", feats);
let mut tikz = File::create("decision_tree_example.tex").unwrap();
tikz.write(gini_model.export_to_tikz().to_string().as_bytes())
.unwrap();
The whole example can be found in linfa-trees/examples/decision_tree.rs.
Release 0.2.1
Changes
- remove feature flags, blocked by rust-lang/cargo#7915
- make ready for crates.io
Release 0.2.0
New algorithms
- Ordinary Linear Regression has been added to
linfa-linear
by [@Nimpruda] and [@paulkoerbitz] - Generalized Linear Models has been added to
linfa-linear
by [@VasanthakumarV] - Linear decision trees were added to
linfa-trees
by [@mossbanay] - Fast independent component analysis (ICA) has been added to
linfa-ica
by [@VasanthakumarV] - Principal Component Analysis and Diffusion Maps have been added to
linfa-reduction
by [@bytesnake] - Support Vector Machines has been added to
linfa-svm
by [@bytesnake] - Logistic regression has been added to
linfa-logistic
by [@paulkoerbitz] - Hierarchical agglomerative clustering has been added to
linfa-hierarchical
by [@bytesnake] - Gaussian Mixture Models has been added to
linfa-clustering
by [@relf]
Changes
- Common metrics for classification and regression have been added
- A new dataset interface simplifies the work with targets and labels
- New traits for
Transformer
,Fit
andIncrementalFit
standardizes the interface - Switched to Github Actions for better integration