Skip to content

Study of dynamics in the scores of outlier detection algorithms

License

Notifications You must be signed in to change notification settings

CN-TU/py-outlier-detection-dynamics

Repository files navigation

What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy

FIV, Sep 2024, v.2

Repository to replicate experiments and results in:

What do anomaly scores actually mean? Key characteristics of algorithms' dynamics beyond accuracy by F. Iglesias, H. O. Marques, A. Zimek, T. Zseby

Comparison of score dynamics and accuracy (S-curves, accuracy, discriminant power, stability, robustness, confidence, coherence, variance) generated by different outlier detection algorithms subjected to different types of perturbations.

0. Seeting up the environment

Experiments have been tested with Python 3.9.6.

Create a new virtual environment and install dependencies with the following commands:

    python -m venv venv

    source venv/bin/activate

    pip install -r requirements.txt 

1. Generate data

Synthetic data is generated with:

    python generate_data.py

This creates the folder [data/synthetic_data] with datasets used for the experiments.

It additionally creates the [plots/synthetic_data] folder with selected plots included in the paper.

Note that the folder [data/real_data] contains 4 real datasets downloaded from: https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/. Specifically, they are:

  • Cardiotocography (nodups, norm, 22%)
  • Shuttle (nodups, norm, v10)
  • Waveform (nodups, norm, v10)
  • Wilt (nodups, norm, 05%)

All datasets (both synthetic and real) are in .CSV format with first row as header and the last column 'y' is the binary label: '1' for outliers, '0' for inliers.

2. Extract outlier detection scores and accuracy performances

To extract outlier scores and accuracy performances, run:

    python outdet.py data/synthetic_data/ minmax

This scrip will take the datasets in [data/synthetic_data] and generate the [scores/minmax] folder. This folder will contain files with point-wise outlierness scores outputed by each algorithm under test. It also creates the file performances/perf_minmax.csv folder and file with a summary table with the overall performances (accuracy metrics). The minmax argument selects the type of normalization applied on the outlierness scores.

For proba-normalization:

    python outdet.py data/synthetic_data/ proba

It will generate scores ([scores/proba]) and summaries (performances/perf_proba.csv) for probability (Gaussian) normalization.

Repeat the process for real datasets with:

    python outdet.py data/real_data/ minmax

    python outdet.py data/real_data/ proba

Warning! When running scripts, information saved in performance files is appended (not rewritten).

3. Extract S-curves and dynamic measurements

To extract S-curves and dynamic measurements:

    python compare_scores_group.py data/synthetic_data scores/minmax minmax

    python compare_scores_group.py data/synthetic_data scores/proba proba

    python compare_scores_group.py data/real_data scores/minmax minmax

    python compare_scores_group.py data/real_data scores/proba minmax

This will generate plots with S-curves in folders: [plots/minmax/S-curves] and [plots/proba/S-curves], also the files performances/dynamic_minmax.csv and performances/dynamic_proba.csv files. Note that the compare_scores_group.py script matches the right dataset and file-with-scores by matching file-names.

Warning! When running scripts, information saved in performance files is appended (not rewritten).

4. Extract Perini's metrics (Stability & Confidence)

To extract Perini's metrics (Stability & Confidence):

    python perini_tests.py data/synthetic_data minmax

    python perini_tests.py data/synthetic_data proba

    python perini_tests.py data/real_data minmax

    python perini_tests.py data/real_data proba

This will create the files: performances/peri_stab_minmax.csv and performances/peri_stab_proba.csv for the Stability measurement, and performances/peri_conf_minmax.csv and performances/peri_conf_proba.csv for the Confidence measurement.

Note that Perini's Confidence is defined element-wise. To obtain a Confidence value per solution we use the 1% quantile.

Warning!! This step can take considerable time in a desktop computer (some days).

Warning!! When running scripts, information saved in performance files is appended (not rewritten).

- Sources and references

Original scripts are obtained from the repositories:

[1] Perini, L., Vercruyssen, V., Davis, J.: Quantifying the confidence of anomaly detectors in their example-wise predictions. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. Springer Verlag (2020).

[2] Perini, L., Galvin, C., Vercruyssen, V.: A Ranking Stability Measure for Quantifying the Robustness of Anomaly Detection Methods. In: 2nd Workshop on Evaluation and Experimental Design in Data Mining and Machine Learning @ ECML/PKDD (2020).

5. Merging all dynamic indices

To merge all dynamic and accuracy indices in a single file (for accuracty we only keep ROC and AAP), run:

    python merge_indices.py performances/dynamic_minmax.csv performances/perf_minmax.csv performances/peri_stab_minmax.csv performances/peri_conf_minmax.csv minmax

    python merge_indices.py performances/dynamic_proba.csv performances/perf_proba.csv performances/peri_stab_proba.csv performances/peri_conf_proba.csv proba

Generated outputs are performances/all_minmax.csv and performances/all_proba.csv.

6. Scatter plots and tables comparing metrics

To generate scatter plots comparing measurements and algorithms, run:

    python scatterplots.py performances/all_minmax.csv minmax

    python scatterplots.py performances/all_proba.csv proba

Additional plots will be generated in the [plots/minmax/performance] and [plots/proba/performance] folders.

To create a table in .TEX format (performances/perf_table.tex) with an overall comparison, run:

    python latex_table.py performances/all_minmax.csv performances/all_proba.csv performances/perf_table.tex

Correlation plots (plots/corr_lin.pdf and plots/corr_gaus.pdf) are generated with:

    python metric_corr.py performances/all_minmax.csv performances/all_proba.csv plots/

7. Saved results

The file performances.zip contains tables with summary results obtained from conducting all previous steps.

About

Study of dynamics in the scores of outlier detection algorithms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages