ASID library comprises autoML tools for small and imbalanced tabular datasets.
For small datasets we propose a GenerativeModel
estimator that searches for an optimal generative algorithm, which outputs similar synthetic samples and does not overfit. Main features of this tool:
- It includes 9 popular generative approaches for small tabular datasets such as kernel density estimation, gaussian mixture models, copulas and deep learning models;
- It is easy-to-use and does not require time-consuming tuning;
- It includes a Hyperopt tuning procedure, which could be controlled by a runtime parameter;
- Several overfitting indicators are available.
For imbalanced datasets ASID library includes a tailored ensemble classifier - AutoBalanceBoost
. It combines a consistent ensemble classifier with the embedded random oversampling technique. ABB key features include:
- It exploits both popular ensemble approaches: bagging and boosting;
- It comprises an embedded sequential parameter tuning scheme, which allows to get the high accuracy without time-consuming tuning;
- It is easy-to-use and does not require time-consuming tuning;
- Empirical analysis shows that ABB demonstrates a robust performance and on average outperforms its competitors.
For imbalanced datasets we also propose an ImbalancedLearningClassifier
estimator that searches for an optimal classifier for a given imbalanced task. Main features of this tool:
- It includes AutoBalanceBoost and combinations of SOTA ensemble algorithms and balancing procedures from imbalanced-learn library;
- It is easy-to-use and does not require time-consuming tuning;
- It includes a Hyperopt tuning procedure for balancing procedures, which could be controlled by a runtime parameter;
- Several classification accuracy metrics are available.
Requirements: Python 3.8.
-
Install requirements from requirements.txt
pip install -r requirements.txt
-
Install ASID library as a package
pip install https://github.com/aimclub/asid/archive/refs/heads/master.zip
Fitting a GenerativeModel instance on small sample and generating a synthetic dataset:
from asid.automl_small.gm import GenerativeModel
from sklearn.datasets import load_iris
X = load_iris().data
genmod = GenerativeModel()
genmod.fit(X)
genmod.sample(1000)
Fitting an AutoBalanceBoost classifier on imbalanced dataset:
from asid.automl_imbalanced.abb import AutoBalanceBoost
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
X, Y = make_classification(n_classes=4, n_features=6, n_redundant=2, n_repeated=0, n_informative=4,
n_clusters_per_class=2, flip_y=0.05, n_samples=700, random_state=45,
weights=(0.7, 0.2, 0.05, 0.05))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
clf = AutoBalanceBoost()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
score = f1_score(y_test, pred, average="macro")
Choosing an optimal classification pipeline with ImbalancedLearningClassifier for imbalanced dataset (searches through AutoBalanceBoost and combinations of SOTA ensemble algorithms and balancing procedures from imbalanced-learn library):
from asid.automl_imbalanced.ilc import ImbalancedLearningClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
X, Y = make_classification(n_classes=4, n_features=6, n_redundant=2, n_repeated=0, n_informative=4,
n_clusters_per_class=2, flip_y=0.05, n_samples=700, random_state=45,
weights=(0.7, 0.2, 0.05, 0.05))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
clf = ImbalancedLearningClassifier()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
score = f1_score(y_test, pred, average="macro")
Results or empirical experiments with ASID algorithms are available here.
Documentation about ASID could be found here.
Examples of usage could be obtained from examples.
GOST:
Plesovskaya, Ekaterina, and Sergey Ivanov. "An Empirical Analysis of KDE-based Generative Models on Small Datasets." Procedia Computer Science 193 (2021): 442-452.
Bibtex:
@article{plesovskaya2021empirical,
title={An empirical analysis of KDE-based generative models on small datasets},
author={Plesovskaya, Ekaterina and Ivanov, Sergey},
journal={Procedia Computer Science},
volume={193},
pages={442--452},
year={2021},
publisher={Elsevier}
}
The study is supported by the Research Center Strong Artificial Intelligence in Industry of ITMO University as part of the plan of the center's program: Development and testing of an experimental prototype of a library of strong AI algorithms in terms of basic algorithms based on generative synthesis of complex digital objects for quality assessment and automatic adaptation of machine learning models to the complexity of the task and sample size