Skip to content

An end-to-end deep learning solution supporting multiple model architectures with advanced features for training, evaluation, and production-ready inference.

License

Notifications You must be signed in to change notification settings

ziraax/PlanktonFlow

Repository files navigation

PlanktonFlow - an End-to-end Deep Learning Pipeline for Automatic Plankton Classification

Python Version PyTorch W&B Integration

PlanktonFlow Logo

An end-to-end deep learning solution supporting multiple model architectures with advanced features for training, evaluation, and production-ready inference.

See the related preprint: PlanktonFlow: hands-on, deep-learning classification of plankton images for biologists.

Table of Contents

Features

Preprocessing

Tailored for our dataset — customizable for yours.

  • Multiple Data Input Formats: Whether your dataset is from EcoTaxa, already organized in classical classification form, or uses a CSV/TSV file, everything is implemented.
  • Data Augmentation: Augments the training data to help address class imbalance.
  • Scalebar Removal: Detects and removes scale bars that are present in some images using a YOLOv8 model.
  • Automatic Data Splitting: The tool makes preprocessing entirely configurable, with automatic data splitting for training/evaluation and automatically generates YAML configuration files.

Training

  • Multi-model Support: YOLOv11, ResNet, DenseNet, EfficientNet
  • Advanced Training:
    • Configurable hyperparameters
    • Early stopping with checkpointing
    • Multiple loss functions
    • More features
  • Model Factory Pattern: Dynamic model creation with variants
  • Tracking & Integration: Real-time tracking of metrics, weights, and model versions using either Weights & Biases or our custom module

Inference

  • Batch Processing: Efficient handling of image directories
  • Flexible Output:
    • Top-K predictions
    • CSV export capabilities
  • Production Ready: Device-aware execution (CPU/GPU)

Installation

This installation process covers all steps, as this project aims to be used by biologists who may not be familiar with setting up such projects. For more experienced users, it follows the general process of setting up a virtual environment, activating it, installing dependencies, and running Python scripts.

  1. Install Python:

This project uses Python and was developed using Python 3.12.3. Please download Python from the official link: Python-3.12.3 by clicking on the version corresponding to your operating system.

Make sure to check "Add Python to PATH" during installation.

  1. Clone or Download the Repository:

Option 1 : Using Git for more experienced users

git clone https://github.com/ziraax/PlanktonFlow.git
cd PlanktonFlow-main

Option 2 : Using the download button

If you are not familiar with Git, you can simply click the green "<> Code" button and select "Download ZIP". Then, extract the project wherever you want on your computer.

  1. Create a virtual environment:

Virtual environments in Python are isolated directories that contain their own Python interpreter and libraries, allowing you to manage dependencies for each project separately. This prevents conflicts between packages required by different projects and ensures reproducible setups. This is especially useful for a project like this, which has many dependencies.

To create a virtual environment, open a terminal in the folder where you downloaded the project and run :

python3 -m venv .venv

Here, .venv will be the name of the folder holding the virtual environment.

  1. Activate the Environment:

Now, depending on your operating system:

  • On Windows using your terminal (CMD), type:
.venv\Scripts\activate
  • On Windows Powershell:
.\.venv\Scripts\Activate.ps1
  • On bash (Linux/macOS):
source .venv/bin/activate

After activation, your terminal will change to show the venv name.

⚠️ If you encounter an error like "Cannot load the file C:.\TaxoNet\venv\Scripts\Activate.ps1 because script execution is disabled on this system.", it means that your current script execution policy is blocking scripts by default for security reasons. To fix this issue, type in a Powershell terminal:

Set-ExecutionPolicy -Scope Process -ExecutionPolicy Bypass

Then activate your virtual environment.

  1. Install Dependencies:

Type in your terminal:

pip install -r requirements.txt

This installs all the packages listed in the file into the virtual environment. This may take several minutes.

You can confirm it worked by typing:

pip list
  1. (Optional) Log into Weights & Biases

Weights & Biases provides robust tools for tracking training runs and performing hyperparameter tuning. It is best suited for users with more advanced needs.

wandb login

Then follow the instructions.

Usages

This chapter goes through all the different ways a user may use the pipeline. The system is designed around YAML configuration files that make it easy to reproduce experiments and manage different setups.

Dataset used in the study

The pipeline was originally developed for a specific use case at INRAE UMR DECOD (Rennes, France), where it supports ongoing research on monitoring plankton communities, and was later extended and modularized so that it can benefit the scientific community.

For convenience, we also provide the dataset used in our study. You can download it from Zenodo - PlanktonFlow76 as a starting point, or substitute your own data.

To preprocess your data

The preprocessing system supports three different data input formats. Choose the configuration that matches your data organization:

1. Hierarchical Data Format (Folder-based)

If your data is organized in folders, where each folder name represents a class:

python3 run_preprocessing.py --config configs/preprocessing/PreprocessWithHierarchical.yaml

Example configuration (configs/preprocessing/simple_hierarchical.yaml):

# Simple Hierarchical Dataset Preprocessing Configuration
input_source:
  type: "hierarchical"
  data_path: "DATA/your_hierarchical_dataset"
  subdirs: []  # Empty = direct class folders under data_path

preprocessing:
  scalebar_removal:
    enabled: true
    model_path: "models/model_weights/scale_bar_remover/best.pt"
    confidence: 0.4
    img_size: 416
  
  grayscale_conversion:
    enabled: true
    mode: "RGB"
  
  image_filtering:
    min_images_per_class: 100
    max_images_per_class: 4000
    skip_corrupted_images: true
  
  data_splitting:
    train_ratio: 0.7
    val_ratio: 0.2
    test_ratio: 0.1
    stratified: true
    random_seed: 42
  
  augmentation:
    enabled: true
    target_images_per_class: 1500
    max_copies_per_image: 10
    techniques:
      horizontal_flip: 0.5
      vertical_flip: 0.2
      rotate_90: 0.3
      brightness_contrast: 0.4
      hue_saturation: 0.3
      
output:
  base_path: "DATA/your_dataset"
  processed_path: "DATA/your_dataset_processed"
  final_dataset_path: "DATA/your_dataset_final"
  create_dataset_yaml: true
  
logging:
  wandb_enabled: false   # In most cases, you don't need WandB logging at the preprocessing step
  log_class_distribution: true  
  log_sample_images: false
  log_processing_times: true

2. CSV/TSV Mapping Format

If you have a CSV/TSV file mapping image paths to class labels:

python3 run_preprocessing.py --config configs/preprocessing/PreprocessWithCSV-TSV.yaml

Example configuration:

# CSV Mapping Dataset Preprocessing Configuration
input_source:
  type: "csv_mapping"
  images_path: "DATA/your_images_directory"
  metadata_file: "DATA/your_labels.csv"
  image_column: "filename"
  label_column: "species"
  separator: ","
  image_path_prefix: ""

preprocessing:
  # Same preprocessing parameters as above
  [...]
      
output:
  base_path: "DATA/your_dataset"
  processed_path: "DATA/your_dataset_processed"
  final_dataset_path: "DATA/your_dataset_final"
  create_dataset_yaml: true
  
logging:
  wandb_enabled: false
  log_class_distribution: true
  log_sample_images: false
  log_processing_times: true

3. EcoTaxa Format

For data exported from EcoTaxa platform:

python3 run_preprocessing.py --config configs/preprocessing/PreprocessWithEcotaxa.yaml

Example configuration:

# EcoTaxa TSV Preprocessing Configuration
input_source:
  type: "ecotaxa"
  data_path: "DATA/your_ecotaxa_folder"
  metadata_file: "ecotaxa_export_TSV_xxxxx.tsv"
  separator: "\t"

preprocessing:
  # Same preprocessing parameters as above
  [...]

output:
  base_path: "DATA/your_ecotaxa"
  processed_path: "DATA/your_ecotaxa_processed"
  final_dataset_path: "DATA/your_ecotaxa_final"
  create_dataset_yaml: true
  
logging:
  wandb_enabled: false
  log_class_distribution: true
  log_sample_images: false
  log_processing_times: true

To train a model

Training is fully configuration-driven. You can train different model architectures with various hyperparameters.

python3 run_training.py --config configs/training/TrainDefault{modelName}.yaml

Basic Training Example

Example configuration (configs/training/TrainDefaultEfficientNet.yaml):

# ============================================================
# EfficientNet B5 Training Configuration
# ============================================================

run_name: "efficientnet_b5_experiment"  # Custom run name in case you don't want it to be generated

# DATA CONFIGURATION
data:
  dataset_path: "DATA/your_dataset_final"  # Path to preprocessed dataset
  
# PROJECT CONFIGURATION
project:
  name: "YOLOv11Classification500"     # W&B project name
  
# MODEL CONFIGURATION
model:
  name: "efficientnet"      # Options: "efficientnet", "resnet", "densenet", "yolov11"
  variant: "b5"             # EfficientNet variants: "b0"-"b7"
  pretrained: true          # Use pretrained weights
  freeze_backbone: false    # Freeze backbone layers
  input_size: 224           # Input image size
  num_classes: 76           # Number of output classes
  
# TRAINING CONFIGURATION
training:
  batch_size: 32               # Training batch size
  learning_rate: 0.001         # Initial learning rate
  weight_decay: 0.01           # Weight decay (L2 regularization) 
  epochs: 50                   # Number of training epochs
  optimizer: "adamw"           # Options: "adam", "adamw", "sgd", "rmsprop"
  early_stopping_patience: 15  # Early stopping patience
  device: "cuda"               # Options: "cuda", "cpu", "auto"
  num_workers: 8               # Number of data loader workers

# LOSS CONFIGURATION
loss:
  type: "focal"                # Options: "focal", "labelsmoothing", "weighted"
  focal_alpha: 1.0             # Focal loss alpha parameter
  focal_gamma: 2.0             # Focal loss gamma parameter
  use_per_class_alpha: true    # Use per-class weights

# WEIGHTS & BIASES CONFIGURATION
wandb:
  log_results: true            # Enable W&B logging
  tags: ["efficientnet", "b5", "production"]
  notes: "Production training run"

Training will always compute metrics on the test set at the end.

Advanced Training Features

Multiple Model Architectures:

  • EfficientNet: variant: "b0" to "b7"
  • ResNet: variant: "18", "34", "50", "101", "152"
  • DenseNet: variant: "121", "161", "169", "201"
  • YOLOv11: name: "yolov11"

Loss Functions:

# Focal Loss (good for imbalanced datasets)
loss:
  type: "focal"
  focal_gamma: 2.0
  focal_alpha: 1.0
  use_per_class_alpha: true

# Label Smoothing
loss:
  type: "labelsmoothing"
  labelsmoothing_epsilon: 0.12

# Weighted Cross-Entropy (automatic class balancing)
loss:
  type: "weighted"

Training Without Weights & Biases:

# Disable W&B for offline training
wandb:
  log_results: false
  tags: ["production", "efficientnet", "b5", "label_smoothing"]
  notes: "Production EfficientNet B5 with label smoothing"

run_name: "offline_experiment"

# All metrics and plots will be saved locally to:
# model_weights/{model_name}/{variant}/{run_name}/

Inference (Making predictions)

Use trained models to make predictions on new images.

python3 run_inference.py --config configs/inference/your_inference_config.yaml

Example configuration (configs/inference/DefaultInference.yaml):

# Model Configuration
model:
  name: "efficientnet"      # Model architecture: "efficientnet", "resnet", "densenet", "yolov11"
  variant: "b5"             # Model variant
  weights_path: "model_weights/efficientnet/b5/my_experiment/best.pt"
  num_classes: 76           # Number of classes (must match training)

# Dataset YAML for Class Names
dataset_yaml: "DATA/final_dataset/dataset.yaml"  # <-- Path to the YAML file containing class names
  # This file is generated automatically during preprocessing.
  # It MUST match the dataset used for training this model.
  # The class names will be read from the 'names' field in this YAML.

# Inference Configuration
inference:
  image_dir: "path/to/new/images"        # Directory containing images to predict
  batch_size: 32                        # Inference batch size
  top_k: 5                              # Return top K predictions per image
  device: "cuda"                        # Device: "cuda", "cpu", "auto"
  save_csv: true                        # Save results as CSV
  output_path: "outputs/predictions.csv" # Output file path

# Optional preprocessing during inference
preprocessing:
  scalebar_removal: true                # Apply scalebar removal if needed
  
# Weights & Biases Configuration
wandb:
  log_results: false                    # Usually disabled for inference
  tags: ["inference", "production"]
  notes: "Production inference run"

How class names are handled:

  • The dataset_yaml field in your inference config should point to the YAML file generated during preprocessing (e.g., DATA/final_dataset/dataset.yaml).
  • The system will read the class names from the names field in this YAML, ensuring the class order matches what was used during training.
  • This prevents label mismatches and makes inference robust to changes in the dataset structure.

Inference Output:

  • CSV file with detailed predictions and confidence scores (specified in output_path)
  • Predictions include top-K classes with probabilities for each image
  • Optional scalebar preprocessing applied automatically if enabled

If your dataset matches ours (see Zenodo-PlanktonFlow76 and the related paper PlanktonFlow: hands-on, deep-learning classification of plankton images for biologists (as of 26.08.25, the preprint is not yet published)), you may want to use our best performing model. The model parameters can be found in the supplementary archive at PlanktonFlow: Supplementary Information. You will need to place the best model parameters in the correct folder, namely, model_weights/efficientnet/b5/{name}/, and modify the inference configuration file accordingly. The model is trained on 76 classes; the class names can be found in: DATA/final_dataset/dataset.yaml.

Hyperparameter Optimization

Perform automated hyperparameter sweeps using Weights & Biases:

python3 run_sweep.py --sweep_config configs/sweeps/densenet_sweep.yaml

Example sweep configuration (configs/sweeps/densenet_sweep.yaml):

# Sweep configuration
program: run_sweep.py
method: bayes  # or random, grid
metric:
  name: val/metrics/accuracy_top1
  goal: maximize

# Fixed parameters
parameters:
  model_name:
    value: densenet
  epochs:
    value: 30
  device:
    value: cuda

  # Parameters to optimize
  densenet_variant:
    values: [121, 161, 169, 201]
  
  batch_size:
    values: [32, 64]
    
  learning_rate:
    distribution: log_uniform_values
    min: 1e-6
    max: 1e-3
    
  loss_type:
    values: ["focal", "labelsmoothing", "weighted"]
    
  # Focal loss parameters (when applicable)
  focal_gamma:
    distribution: uniform
    min: 1.0
    max: 3.0

Running Sweeps:

  1. The sweep automatically creates multiple training runs
  2. Each run tests different hyperparameter combinations
  3. Results are logged to Weights & Biases for comparison
  4. Best configurations are automatically identified
  5. When you are satisfied with the results, you can stop the sweep in the terminal.

Quick Start Examples

Complete Workflow Example:

# 1. Preprocess your data
python3 run_preprocessing.py --config configs/preprocessing/PreprocessWithHierarchical.yaml

# 2. Train a model
python3 run_training.py --config configs/training/TrainDefaultEfficientNet.yaml

# 3. Make predictions
python3 run_inference.py --config configs/inference/DefaultInference.yaml

Reproducing Results

To facilitate reproducibility, we provide configuration files for all models and their corresponding hyperparameters in the configs/reproduce_paper directory. Using these training configurations together with the preprocessed dataset available at Zenodo-PlanktonFlow76 should enable you to reliably reproduce the results presented in the paper.

Results and monitoring

Monitoring

If you choose not to use Weights & Biases, our custom monitoring module will collect training data and write it to a log file in real time, allowing you to monitor your runs as they progress.

The file will be saved at: model_weights/{model_name}/{variant}/{run_name}/training_log.txt.

Example:

Training Log - Reproduce_best_model_20250814_141638
============================================================
Model: efficientnet b5
Dataset: DATA/final_dataset
Batch Size: 64
Learning Rate: 1e-5
Loss Type: labelsmoothing
Started: 2025-08-14 14:16:39
============================================================
Epoch | Train Loss | Val Loss | Top-1 Acc | Top-5 Acc | F1 Macro | Recall Macro | Precision Macro | LR
---------------------------------------------------------------------------------------------------------
    1 |     2.7936 |   1.6791 |    0.7390 |    0.9518 |   0.6957 |       0.7385 |          0.7072 | 1.00e-05
    2 |     1.5423 |   1.3701 |    0.8247 |    0.9806 |   0.7969 |       0.8204 |          0.7892 | 1.00e-05
    3 |     1.3472 |   1.2886 |    0.8498 |    0.9877 |   0.8246 |       0.8387 |          0.8202 | 9.99e-06
    4 |     1.2635 |   1.2551 |    0.8603 |    0.9892 |   0.8375 |       0.8534 |          0.8298 | 9.98e-06

This same data is also stored in training_log.csv for further analysis.

Results

A notebook named results_analysis.ipynb is available to further analyze training metrics and inference results. To adapt it to your own runs, modify the first cell as follows:

# Replace with your model path
MODEL_DIR = "model_weights/{model_name}/{variant}/{run_name}"
TRAINING_LOG_PATH = f"{MODEL_DIR}/training_log.csv"
CLASSIF_REPORT = f"{MODEL_DIR}/classification_report.csv"

Once updated, click on Run All to generate figures and insights about your classification model and training process, including:

  • Validation & training loss over epochs
  • Evolution of accuracies over epochs
  • Metrics over epochs
  • F1-score vs. support
  • Best and worst classified classes
  • Confusion matrix
  • Most frequently confused classes

This is a starter kit to evaluate your model, which can be easily extended for more advanced analyses.

Contributing

We welcome all pull requests — from small fixes to big new features.
If you’d like to help improve this project, please check out our CONTRIBUTING.md for guidelines and the contribution workflow.

Citation

If you use this project in your research, please cite it as:

@software{walter2025planktonflow,
  author = {Walter, Hugo},
  title        = {PlanktonFlow - Deep Learning Classification Pipeline for Automatic Plankton Classification},
  version      = {1.0.0},
  date         = {2025-08-01},
  publisher    = {GitHub},
  url          = {https://github.com/ziraax/PlanktonFlow},
  license      = {MIT}
}

License

This project is open source and distributed under the MIT License.
See the LICENSE.md file for details.


Thanks for checking out this project! I hope it’s useful for the scientific community. A special shout-out to the INRAE team (Rennes, France), this internship was intense but incredibly rewarding, and I learned a ton while developing this tool! 🌱

About

An end-to-end deep learning solution supporting multiple model architectures with advanced features for training, evaluation, and production-ready inference.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published