On the generalization of probabilistic neural networks for hyperspectral remote sensing of absorption properties in optically complex waters
This repository contains the Python code used in our paper, submitted to Remote Sensing of Environment. The following sections describe the repository and codebase in detail. Text written in italics is work-in-progress. The last section provides a summary containing just the information necessary to reproduce the paper.
Abstract
Machine learning models have steadily improved in estimating inherent optical properties (IOPs) from remote sensing observations. Yet, their generalization ability when applied to new water bodies, beyond those they were trained on, is not well understood. We present a novel approach for assessing model generalization across various scenarios, including interpolation within in situ observation datasets, extrapolation beyond the training scope, and application to hyperspectral observations from the PRecursore IperSpettrale della Missione Applicativa (PRISMA) satellite involving atmospheric correction. We evaluate five probabilistic neural networks (PNNs), including novel architectures like recurrent neural networks, for their ability to estimate absorption at 443 and 675 nm from hyperspectral reflectance. The median symmetric accuracy (MdSA) worsens from ≥25% in interpolation scenarios to ≥50% in extrapolation scenarios, and reaches ≥80% when applied to PRISMA satellite imagery. Across all scenarios, models produce uncertainty estimates exceeding 40%, often reflecting systematic underconfidence. PNNs show better calibration during extrapolation, suggesting an intrinsic awareness of retrieval constraints. To address this miscalibration, we introduce an uncertainty recalibration method that only withholds 10% of the training dataset, but improves model calibration in 86% of PRISMA evaluations with minimal accuracy trade-offs. Resulting well-calibrated uncertainty estimates enable reliable uncertainty propagation for downstream applications. IOP retrieval uncertainty is predominantly aleatoric (inherent to the observations). Therefore, increasing the number of measurements from the same distribution or selecting a different neural network architecture trained on the same dataset does not enhance model accuracy. Our findings indicate that we have reached a predictability limit in retrieving IOPs using purely data-driven approaches. We therefore advocate embedding physical principles of IOPs into model architectures, creating physics-informed neural networks capable of surpassing current limitations.
Most of the functionalities related to data handling, model training, and analysis have been refactored into the pnn
module.
This module handles
constructing, training, testing, recalibrating, and applying the neural network models;
analysing and visualising the model estimates;
generating outputs for the paper.
A more detailed overview is provided in its documentation.
The code used to (re-)produce the results in the paper is organised into multiple scripts within the top-level folder.
Detailed descriptions of the various scripts are provided in the following sections.
All scripts use argparse
, meaning the help flag -h
is available, e.g.:
python train_nn.py -h
will return:
usage: train_nn.py [-h] [-o OUTPUT_FOLDER] [-e ESTIMATES_FOLDER] [-p] [-c]
[-n N_MODELS]
pnn_type
Script for loading data and training a probabilistic neural network.
Trains N networks, evaluates them, and saves their outputs.
Selects the type of network from the first argument: [bnn_dc, bnn_mcd, ens_nn, mdn, rnn].
Note that model files and estimates are saved to default folders and will override existing files; custom locations can be specified (-o, -e).
Example:
python train_nn.py bnn_mcd
python train_nn.py bnn_mcd -p
python train_nn.py bnn_mcd -c
python train_nn.py bnn_mcd -pc -o path/to/models/ -e path/to/estimates/ -n 10
positional arguments:
pnn_type PNN architecture to use
options:
-h, --help show this help message and exit
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Folder to save models to.
-e ESTIMATES_FOLDER, --estimates_folder ESTIMATES_FOLDER
Folder to save model estimates to.
-p, --prisma Use PRISMA data.
-c, --recalibrate Apply recalibration.
-n N_MODELS, --n_models N_MODELS
Number of models to train per scenario (default: 25).
The core in situ datasets used in our study originate from GLORIA and SeaBASS. The combined in situ dataset is available from zenodo link
The in situ dataset is located at datasets_train_test/insitu_data.csv by default. Also available is the same dataset resampled to the PRISMA spectral bands, at datasets_train_test/insitu_data_resampled.csv
A data file in CSV format with column headers can be split using the dataset_split.py script.
This script does not require installation of the wider pnn
module, but can be used by itself.
dataset_split.py can also be safely imported if you want to re-use its functionality elsewhere.
Its dependencies are Numpy, Pandas, Scipy, and Scikit-learn.
The within-distribution and out-of-distribution splits are implemented through general functions called system_data_split
and objective
,
which can easily be built upon for new data split types,
by introducing new scoring functions.
The script is called as follows:
python dataset_split.py path/to/data.csv
Several options are available, as shown below:
python dataset_split.py -h
usage: dataset_split.py [-h] [-o OUTPUT_FOLDER] [-s SYSTEM_COLUMN]
[-t TIMEOUT] [-r RNG]
filename
Script for splitting a dataset using random, within-distribution, and out-of-distribution splits.
Data are split on one system column, provided with the `-s` flag (default: "lake_name").
(Dis)similarity scores are evaluated on multiple `summary_cols`, specified at the start of the script.
Please note that the script can slightly run past its `timeout`; this is not a bug.
Example:
python dataset_split.py datasets_train_test/insitu_data.csv
python dataset_split.py datasets_train_test/insitu_data.csv -o path/to/outputs/ -s site_name -t 10 -r 42
positional arguments:
filename File with data to split.
options:
-h, --help show this help message and exit
-o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER
Folder to save files with splits to.
-s SYSTEM_COLUMN, --system_column SYSTEM_COLUMN
Column with system names, on which to split the data.
-t TIMEOUT, --timeout TIMEOUT
Maximum time [min] to spend on each split.
-r RNG, --rng RNG Seed for random number generator (RNG).
The dataset_split.py script will save the resulting dataframes to your chosen location (by default, the working directory) in 6 new CSV files:
("random_train_set.csv"
, "random_test_set.csv"
, "wd_train_set.csv"
, "wd_test_set.csv"
, "ood_train_set.csv"
, "ood_test_set.csv"
)
All other steps in the model training, estimation, and analysis for the in situ scenarios use the resulting split data files (random, within-distribution, out-of-distribution).
These files are read using the pnn.data.read_insitu_data
function.
This function can load files from any folder on your computer; it will try to load the aforementioned 6 CSV files from the given folder.
By default, it uses the datasets_train_test folder within the repository folder; this setting can be changed at pnn.constants.insitu_data_path
.
Example:
train_data, test_data = pnn.read_insitu_data()
Explain data format in Python.
plot_data.py - Generates figures showing the IOP distributions in the input data and train/test sets in each split scenario. as follows, which will save the resulting figures to manuscript_figures/full_dataset.pdf and manuscript_figures/scenarios.pdf.
python plot_data.py
In situ match-up data from PRISMA validation campaigns are used to evaluate the performance of PNN models in realistic scenarios.
The in situ match-up data are read using the pnn.data.read_prisma_insitu
function;
by default, this reads datasets_train_test/prisma_insitu_data.csv.
Matching measurements from the PRISMA satellite, atmospherically corrected using ACOLITE or the PRISMA L2C processor, are also used. By default, these files are located at datasets_train_test/prisma_acolite.csv and datasets_train_test/prisma_l2c.csv, respectively.
The pnn.data.read_prisma_matchups
function is used to read PRISMA match-up data.
This function will return two DataScenario
objects,
the first representing the general case (train on the resampled in situ dataset; test on the PRISMA in situ data or the ACOLITE/L2C-processed satellite data),
the second representing the local knowledge case (train on the resampled in situ and PRISMA in situ match-up data together; test on the ACOLITE/L2C-processed satellite data).
Trained models can be applied to PRISMA scenes using the apply_to_prisma.py
script.
This script takes all atmospherically corrected (L2C or ACOLITE) NetCDF4 (.nc
) files in a given folder (prisma_map_data
by default) and applies a given model type (bnn_mcd
, bnn_dc
, etc.) to each.
By default, the average-performing instance of that architecture is found in pnn_tf_models
;
a custom model file can be used through the -m
flag.
A water mask will be generated using a normalised differential water index (NDWI).
If available, level-1 HDF5 (.h5
) data for the same scene will be used to generate a greyscale background, using an approximated luminance function.
The script will save the results to a new NetCDF4 file so they can be re-used without having to re-run the analysis.
It will also save a figure showing the Rrs and IOPs at 443-446 nm.
Please note that due to the number of pixels (up to a million), PNN application can be very slow, especially for RNNs.
Plot script
Neural network models are trained using the train_nn.py script.
This script uses the architectures defined in the pnn.nn
submodule, currently
bnn_dc
(Bayesian Neural Network with DropConnect),
bnn_mcd
(Bayesian Neural Network with Monte Carlo Dropout),
ens_nn
(Ensemble Neural Network),
mdn
(Mixture Density Network),
and
rnn
(Recurrent Neural Network).
These architectures use the same backbone, namely the BasePNN
class, ensuring consistency and making it very simple to add new architectures in the future.
The base class and its descendants are all implemented using TensorFlow/Keras.
To train one of these architectures, call the train_nn.py
script with the architecture abbreviation as a command-line argument, e.g. to train a BNN-DC:
python train_nn.py bnn_dc
By default, the script will load the in situ data and train/evaluate on its corresponding scenarios in sequence.
To use the PRISMA data scenarios (general or local knowledge), use the -p
flag, e.g.:
python train_nn.py bnn_dc -p
To train models with recalibration, simply add the -c
flag, e.g.:
python train_nn.py bnn_dc -c
or
python train_nn.py bnn_dc -pc
Lastly, you can specify the number of models to train/evaluate for each scenario using the -n
argument;
the default value is 25.
Say you want to train only 5 models, use:
python train_nn.py bnn_dc -n 5
Trained models are saved to the pnn.model_path
constant;
by default, this is the pnn_tf_models folder.
Models are saved in ZIP files containing TensorFlow/Keras (.keras
) files, re-scaling functions for X and y, and recalibration functions (where appropriate).
A saved model can be loaded using the .load
function of the relevant class, e.g. BNN_DC.load
or RNN.load
.
The ENS-NN (ensemble neural network) is saved as a ZIP file containing its constituent networks within the general ZIP file, but otherwise works the same;
the Ensemble.save
and Ensemble.load
functions will automatically take care of filename extensions.
Recalibrated models are saved and loaded with their corresponding recalibration functions.
analyze_estimates.py - Analyze PNN model outputs to generate figures and statistics.
apply_to_prisma.py - Apply PNNs to PRISMA scenes, plot the results.
Run dataset_split.py as follows:
Run train_nn.py as follows (with/without flags):
Figures 1 and 2 were generated using plot_data.py, with no command-line arguments.
Figure 3 was made by hand.
Figure 4 was generated using plot_calibration_example.py, with no command-line arguments.
Run analyze_estimates.py as follows (with/without flags):
Run apply_to_prisma.py as follows (with/without flags):