Skip to content

AutoOC: Automated Machine Learning (AutoML) library focused on One-Class Learning algorithms (AutoEncoders, Isolation Forest and One-Class SVM)

License

Notifications You must be signed in to change notification settings

luisferreira97/AutoOC

Repository files navigation

Contributors Forks Stargazers Issues MIT License LinkedIn


Logo

AutoOC (in Beta)

AutoOC: Automated Machine Learning (AutoML) library focused on One-Class Learning algorithms (Deep AutoEncoders, Variational AutoEncoders, Isolation Forest, Local Outlier Factor and One-Class SVM)
Report Bug · Request Feature

Getting Started

This section presents how the package can be reached and installed.

Where to get it

The source code is currently hosted on GitHub at: https://github.com/luisferreira97/AutoOC

Binary installer for the latest released version are available at the Python Package Index (PyPI). The PyPI name of the package is autooc.

pip install autooc

Usage

1. Import the package

The first step in using the package is, after it has been installed, to import it. The main class from which all the methods are available is AutoOC.

from autooc.autooc import AutoOC

2. Instantiate a AutoOC object

The second step is to instantiate the AutoOC class with the information about your dataset and context (e.g., normal and anomaly classes, wether to run single-objective or multi-objective, the performance_metric, and the algorithm). You can change the algorithm parameter to select which algorithms are used during the optimization. The options are:

  • "autoencoders": Deep AutoEncoders (from TensorFlow)
  • "vae": Variational AutoEncoders (from TensorFlow)
  • "iforest": Isolation Forest (from Scikit-Learn)
  • "lof": Local Outlier Factor (Scikit-Learn)
  • "svm": One-Class SVM (from Scikit-Learn)
  • "nas": the optimization is done using AutoEncoders and VAEs
  • "all": the optimization is done using all five algorithms

For the performance_metric parameter to select which algorithms are used during the optimization. The options are:

  • "training_time": Minimizes training time
  • "predict_time": Minimizes the time it takes to predict one record
  • "num_params": Minimizes the number of parameters (count_params() in Keras); only available when algorithm equals to autoencoders, vae, or nas.
  • "bic": Minimizes the value of the Bayesian Information Criterion
aoc = AutoOC(anomaly_class = 0,
    normal_class = 1,
    multiobjective=True,
    performance_metric="training_time",
    algorithm = "autoencoder"
)

3. Load dataset

The third step is to load the dataset. Depending on the type of validation you need train data (only 'normal' instances), validation data (you can use (1) only 'normal' instances or (2) both 'normal' and 'anomaly' instances with the respective labels), and test data (both types of instances and labels). You can use the load_example_data() function to load the popular ECG dataset.

X_train, X_val, X_test, y_test = aoc.load_example_data()

4. Train

The fourth step is to train the model. The fit() function computes the optimization using the given parameters.

run = aoc.fit(
    X=X_train,
    X_val=X_val,
    pop=3,
    gen=3,
    epochs=100,
    mlflow_tracking_uri="../results",
    mlflow_experiment_name="test_experiment",
    mlflow_run_name="test_run",
    results_path="../results"
)

5. Predict

The fifth step is to predict the labels of the test data. You can use the predict() function to predict the labels of the test data. You can change the mode parameter to select which individuals are used to predict.

  • "all": uses all individuals (models) from the last generation
  • "best": uses the from the last generation which achieved the best predictive metric
  • "simplest": uses the from the last generation which achieved the best efficiency metric
  • "pareto": uses the pareto individuals from the last generation (only for multiobjective. These are the models that achieved simultaneouly the best predictive metric and efficiency metric.

Additionally, you can use the threshold parameter (only used for AutoEncoders) to set the threshold for the prediction. You can use the following values:

  • "default": uses a different threshold value for each individual (model). For each model the threshold value is the associated default value (currently this works similar to the "mean" value).
  • "mean": For each model the threshold value is the sum of the mean reconstruction error obtained on the validation data and one standard deviation.
  • "percentile": For each model the threshold value is the 95th percentile of the reconstruction error obtained on the validation data (you can also use the percentile parameter to change the percentile).
  • "max": For each model the threshold value is maximum reconstruction error obtained on the validation data.
  • You can also pass an Integer of Float value. In this case, the threshold value is the same for all the models.
predictions = aoc.predict(X_test,
    mode="all",
    threshold="default")

6. Evaluate

You can use the predictions to calculate manually the performance metrics of the model. However, the evaluate() function is a more convenient way to do it. You can also use the mode parameter (works similarly to the predict() function) and use metrics from the sklearn.metrics package (currently available are "roc_auc", "accuracy", "precision", "recall", and "f1").

score = aoc.evaluate(X_test,
    y_test,
    mode="all",
    metric="roc_auc",
    threshold="default")

Usage (Full Example)

from autooc.autooc import AutoOC

aoc = AutoOC(anomaly_class = 0,
    normal_class = 1,
    multiobjective=True,
    performance_metric="training_time",
    algorithm = "autoencoder"
)

X_train, X_val, X_test, y_test = aoc.load_example_data()

run = aoc.fit(
    X=X_train,
    X_val=X_val,
    pop=3,
    gen=3,
    epochs=100,
    mlflow_tracking_uri="../results",
    mlflow_experiment_name="test_experiment",
    mlflow_run_name="test_run",
    results_path="../results"
)

predictions = aoc.predict(X_test,
    mode="all",
    threshold="default")

score = aoc.evaluate(X_test,
    y_test,
    mode="all",
    metric="roc_auc",
    threshold="default")
print(score)

Topic Definition

Grammatical Evolution (GE)

Grammatical Evolution (GE) is a biologically inspired evolutionary algorithm for generating computer programs. The algorithm was proposed by O’Neill and Ryan in 2001 and has been widely used in both optimization and ML tasks. In GE, a set of programs is represented as strings of characters, known as chromosomes. The chromosomes are encoded using a formal grammar, which defines the syntax and structure of the programs. The grammar is used to parse the chromosomes and generate the corresponding programs, which are then evaluated using a fitness function. The fitness function measures the quality of the programs and is used to guide the evolution process toward better solutions.

One of the main advantages of GE is its ability to generate programs in any language, as long as a suitable grammar is defined. This makes GE a versatile tool for developing custom software solutions for a wide range of applications. In addition to its flexibility and versatility, GE can handle complex optimization problems with a large number of objectives and constraints. It can also handle continuous and discrete optimization problems, as well as problems with mixed variables. GE has been shown to be effective in finding high-quality solutions in a relatively short time, compared to other optimization methods.

A GE execution starts by creating an initial population of solutions (usually randomly), where each solution (usually named individual) corresponds to an array of integers (or genome) that is used to generate the program (or phenotype). In the evolutionary process of GE, each generation consists of two main phases: evolution and evaluation. During the evolution phase, new solutions are generated using operations such as crossovers and mutations. Crossover involves selecting pairs of individuals as parents and swapping their genetic material to produce new individuals, known as children. Mutation, which is applied to the children individuals after crossover, consists of randomly altering their genome to maintain genetic diversity. In the evaluation phase, the population of individuals is evaluated using the fitness function.

GE uses a mapping process to generate programs from a genome encoded using a formal grammar, typically in Backus-Naur Form (BNF) notation. This notation consists of terminals, which represent items that can appear in the language, and non-terminals, which are variables that include one or more terminals.

Nondominated Sorting Genetic Algorithm II (NSGA-II)

Nondominated Sorting Genetic Algorithm II (NSGA-II) is a multi-objective optimization algorithm that was proposed in 2002. The algorithm is based on the concept of non-dominance, which means that a solution is considered superior to another solution if it is not worse than the other solution in any objective and strictly better in at least one objective. The goal of NSGA-II is to find a set of non-dominated solutions, known as the Pareto front, which represents the trade-off between the different objectives.

One of the main features of NSGA-II is its ability to handle constraints. The algorithm handles constraints by assigning a penalty value to solutions that violate the constraints. The penalty value is then used as an additional objective, which is minimized during the optimization process. NSGA-II also includes a crowding distance measure, which is used to preserve diversity among the solutions and avoid premature convergence. The algorithm has been widely used in various fields, including engineering, economics, and biology, and has shown promising results in a variety of multi-objective optimization problems.

One-Class Classification (OCC)

Also known as unary classification, One-Class Classification (OCC) can be viewed as a subclass of unsupervised learning, where the Machine Learning model only learns using training examples from a single class [8, 9]. This type of learning is valuable in diverse real-world scenarios where labeled data is non-existent, infeasible, or difficult (e.g., requiring a costly and slow manual class assignment), such as fraud detection, cybersecurity, predictive maintenance or industrial quality assessment.

Citation

To cite this work please use the following article:

@article{FERREIRA2023110496,
  author = {Luís Ferreira and Paulo Cortez}
  title = {AutoOC: Automated multi-objective design of deep autoencoders and one-class classifiers using grammatical evolution},
  journal = {Applied Soft Computing},
  volume = {144},
  pages = {110496},
  year = {2023},
  issn = {1568-4946},
  doi = {https://doi.org/10.1016/j.asoc.2023.110496},
  url = {https://www.sciencedirect.com/science/article/pii/S1568494623005148}
}

Built With

Roadmap

See the open issues for a list of proposed features (and known issues).

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Contact

Luís Ferreira - LinkedIn - luis_ferreira223@hotmail.com

Project Link: https://github.com/luisferreira97/AutoOC

Acknowledgements

About

AutoOC: Automated Machine Learning (AutoML) library focused on One-Class Learning algorithms (AutoEncoders, Isolation Forest and One-Class SVM)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages