Skip to content

esa/AnomalyMatch

Repository files navigation

AnomalyMatch

High-performance semi-supervised anomaly detection with active learning

Demo search of Hubble Legacy Archive cutouts

Table of Contents

Overview

This package uses a FixMatch pipeline built on EfficientNet models and provides a mechanism for active learning to detect anomalies in images. It also offers a GUI via ipywidgets for labelling and managing the detection process, including the ability to unlabel previously labelled images.

AnomalyMatch is available plug-and-play on GPUs in ESA Datalabs, providing seamless access to high-performance computing resources for large-scale anomaly detection tasks.

For detailed information about the method and its applications, see our papers:

Requirements

Dependencies are listed in the environment.yml file. To leverage the full capabilities of this package (especially training on large images or predicting over large image datasets), a GPU is strongly recommended. Use with Jupyter notebooks is recommended (see StarterNotebook.ipynb) since the UI relies on ipywidgets.

Installation

# Clone the repository
git clone https://github.com/ESA/AnomalyMatch.git
cd AnomalyMatch

# Create and activate conda environment from the environment.yml file
conda env create -f environment.yml
conda activate am

# Install the package (use -e for development mode)
pip install .

After installation, you can start using AnomalyMatch in your Jupyter notebooks. See StarterNotebook.ipynb for an example.

Session Tracking

AnomalyMatch automatically tracks comprehensive session information including training iterations, model checkpoints, labelled samples, and performance metrics. All session data is saved in organised directories under anomaly_match_results/sessions/ with the structure:

session_name_timestamp/
├── session_metadata.json    # Complete session tracking data
├── labeled_data.csv         # All labelled samples
├── config.toml              # Final configuration
└── model.pth                # Model checkpoint

You can view any saved session using:

import anomaly_match as am
am.print_session('/path/to/session/directory')

Session tracking is automatic and integrates seamlessly with existing workflows.

Recommended Folder Structure

  • project/
    • labeled_data.csv | containing annotations of labelled examples
    • metadata.csv | containing metadata, e.g. sourceIDs, for images (optional)
    • training_images/ | the cfg.data_dir, can contain .jpeg, .jpg, .png, .fits, .tif, or .tiff files
      • image1.png
      • image2.png
    • data_to_predict/ | the cfg.prediction_search_dir
      • unlabeled_file_part1.hdf5
      • unlabeled_file_part2.hdf5
      • large_dataset.zarr
      • individual_images/
        • img001.jpg
        • img002.png

Example of a minimal labeled_data.csv:

filename,label,your_custom_source_id
image1.png,normal,123456
image2.png,anomaly,424242

Here, the additional columns (like "your_custom_source_id") can store your own identifiers or data.

Example of a metadata.csv:

filename,sourceID,ra,dec,custom_col
image1.png,source1,10.5,20.3,custom_value1
image2.png,source2,11.2,21.7,custom_value2

The metadata file can include optional columns for sourceID, ra, dec, and any custom columns you need. This metadata is automatically merged with the labelled data when saving results. Specify the metadata file with cfg.metadata_file = "path/to/metadata.csv".

The ra and dec coordinates both have to be in degree and in the ICRS frame.

Supported File Formats

AnomalyMatch supports the following image file formats:

  • Standard formats: JPEG (.jpg, .jpeg), PNG (.png), TIFF (.tif, *.tiff)
  • Astronomical formats: FITS (*.fits)
  • Container formats: HDF5 (*.h5, .hdf5), Zarr (.zarr)

Note: If multiple filetypes are present, all will be loaded.

Zarr File Support

AnomalyMatch supports Zarr files for efficient storage and processing of large image datasets. Zarr files are particularly useful for:

  • Large collections of images that don't fit in memory
  • Distributed and cloud-based workflows
  • Efficient chunked access to image data

Zarr File Requirements

Zarr files must contain:

  • An images dataset with shape (N, height, width, channels) where N is the number of images
  • Optional metadata file (.parquet format) containing filenames

Creating Zarr Files

You can create compatible Zarr files using the images_to_zarr utility, which converts collections of images into the Zarr format expected by AnomalyMatch.

Example workflow:

# Install images_to_zarr
pip install images_to_zarr
# Convert a directory of images to 150x150 pixel zarr format
import images_to_zarr as i2z
i2z.convert(
    output_dir="path/to/output.zarr",
    folders="path/to/images", 
    resize=(150, 150), 
    chunk_shape=(1000, 4, 150, 150)  # 1000 images per chunk
)

For best performance, we recommend using chunks of 1000 images (chunk_shape=(1000, channels, height, width)).

The resulting Zarr file will contain:

  • /images: The image array with proper chunking
  • Associated metadata file with original filenames

Zarr Configuration

AnomalyMatch automatically detects and processes Zarr files in your prediction directory:

cfg.prediction_search_dir = "/path/to/directory/containing/zarr/files"

AnomalyMatch will automatically discover all .zarr files in the specified directory and process them efficiently in parallel. Each Zarr file should contain image data with optional metadata in a corresponding .parquet file.

FITS File Handling

  • By default, the first extension (index 0) is used when loading FITS files
  • You can specify a particular extension using the fits_extension parameter in the configuration:
    • Set cfg.fits_extension in your code to control which FITS extensions to use
    • Integer values (e.g., 0, 1, 2) to access extensions by index
    • String values (e.g., "PRIMARY", "SCIENCE") to access extensions by name
    • List of integers or strings (e.g., [0, 1, 2] or ["PRIMARY", "SCIENCE", "ERROR"]) to combine multiple extensions into a single image. All specified extensions must have the same shape.
  • Multi-dimensional data is handled automatically:
    • For data with more than 3 dimensions, only the first 3 dimensions are used
    • FITS data are normalised to the 0-255 range when loaded (uint8)
    • Channel order is automatically corrected if necessary
  • When combining multiple extensions:
    • If extensions contain 2D data, they will be combined as channels (up to 3 for RGB)
    • If more than 3 extensions are provided for 2D data, only the first 3 will be used
    • All extensions must have identical dimensions to be combined

When working with FITS files containing multiple images or data products, specify which extension(s) to use in the configuration.

Normalisation and Stretching

  • Normalisation can be selected in the UI via a drop-down. Alternatively it can be changed by setting e.g. cfg.normalisation_method = am.NormalisationMethod.ZSCALE
  • Current options are
    • CONVERSION_ONLY: no normalisation
    • LOG: logarithmic normalisation
    • ZSCALE: linear normalisation based on zscale min and max.
    • ASINH: Asinh normalisation with configurable scale and percentile clipping for both grayscale/multichannel and RGB images.
  • It currently allows an enum from NormalisationMethod
  • Selecting a new normalisation in the dropdown will apply it when training or predicting. For further detail see Normalisation-Readme

Key Config Parameters

  • save_dir: Path to store the trained model output.
  • data_dir: Location of the training data (*.jpeg, *.jpg, *.png, *.tif, or *.tiff).
  • label_file: CSV mapping annotated images to labels.
  • metadata_file: Optional CSV file containing metadata for images (automatically merged with labelled data).
  • prediction_search_dir: Path where data to be predicted is stored.
  • logLevel: Controls verbosity of training/session logs.
  • test_ratio: Proportion of data used for evaluation (0.0 disables test evaluation, > 0 shows AUROC/AUPRC curves).
  • size: Dimensions to which images are resized (below 96x96 is not recommended).
  • N_to_load: Number of unlabeled images loaded into the training dataset at once.
  • output_dir: Folder for storing results (e.g., labeled_data.csv or final logs).

Advanced CFG Parameters

The following advanced parameters can be configured:

FixMatch Parameters

  • ema_m: Exponential moving average momentum (default: 0.99)
  • hard_label: Whether to use hard labels for unlabelled data (default: True)
  • temperature: Temperature for softmax in semi-supervised learning (default: 0.5)
  • ulb_loss_ratio: Weight of the unlabeled loss (default: 1.0)
  • p_cutoff: Confidence threshold for pseudo-labeling (default: 0.95)
  • uratio: Ratio of unlabeled to labeled data in each batch (default: 5)

Training Parameters

  • num_workers: Number of parallel workers for data loading (default: 4)
  • batch_size: Training batch size (default: 16)
  • lr: Learning rate (default: 0.0075)
  • weight_decay: L2 regularization parameter (default: 7.5e-4)
  • opt: Optimizer type (default: "SGD")
  • momentum: SGD momentum (default: 0.9)
  • bn_momentum: Batch normalization momentum (default: 1.0 - ema_m)
  • num_train_iter: Number of training iterations (default: 200)
  • eval_batch_size: Batch size for evaluation (default: 500)
  • num_eval_iter: Evaluation frequency, -1 means no evaluation (default: -1)
  • pretrained: Whether to use pretrained backbone (default: True)
  • net: Backbone network architecture (default: "efficientnet-lite0")

Additional Parameters

  • fits_extension: Extension(s) to use for FITS files, can be int, string, or list of int/string (default: None)
  • interpolation_order: 0-5 corresponding to skimage resize interpolation orders (default: 1 (Bi-linear))
  • normalisation_method: Normalisation method to be applied during file loading. Can also be selected in the UI dropdown. Correspons to an entry from the class NormalisationMethod (default: NormalisationMethod.CONVERSION_ONLY)

Acknowledgements

Thank you to all users who have provided feedback and helped us to make AnomalyMatch better. Your contributions help continue improving this tool for the scientific community.

About

Semi-supervised and active learning for anomaly detection for astronomy

Resources

License

Stars

Watchers

Forks

Packages

No packages published