Official Benchmark Implementation
The original paper establishing the FD-Shifts benchmark was presented as an Oral at ICLR 2023 (top 5%).
β project page β paper link
Our follow-up study on Failure Detection in Medical Image Classification was presented at MICCAI 2023.
β project page β paper link β interactive tool SF-Visuals
Our paper on a revised evaluation protocol for Selective Classification Systems was accepted as Spotlight paper at NeurIPS 2024.
β project page β paper link β AUGRC implementation
Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-life application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring.
Holistic perspective on failure detection. Detecting failures should be seen in the context of the overarching goal of preventing silent failures of a classifier, which includes two tasks: preventing failures in the first place as measured by the "robustness" of a classifier (Task 1), and detecting the non-prevented failures by means of CSFs (Task 2, focus of this work). For failure prevention across distribution shifts, a consistent task formulation exists (featuring accuracy as the primary evaluation metric) and various benchmarks have been released covering a large variety of realistic shifts (e.g. image corruption shifts, sub-class shifts, or domain shifts). In contrast, progress in the subsequent task of detecting the non-prevented failures by means of CSFs is currently obstructed by three pitfalls: 1) A diverse and inconsistent set of evaluation protocols for CSFs exists (MisD, SC, PUQ, OoD-D) impeding comprehensive competition. 2) Only a fraction of the spectrum of realistic distribution shifts and thus potential failure sources is covered diminishing the practical relevance of evaluation. 3) The task formulation in OoD-D fundamentally deviates from the stated purpose of detecting classification failures. Overall, the holistic perspective on failure detection reveals an obvious need for a unified and comprehensive evaluation protocol, in analogy to current robustness benchmarks, to make classifiers fit for safety-critical applications. Abbreviations: CSF: Confidence Scoring Function, OoD-D: Out-of-Distribution Detection, MisD: Misclassification Detection, PUQ: Predictive Uncertainty Quantification, SC: Selective Classification
If you use FD-Shifts please cite our paper
@inproceedings{
jaeger2023a,
title={A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification},
author={Paul F Jaeger and Carsten Tim L{\"u}th and Lukas Klein and Till J. Bungert},
booktitle={International Conference on Learning Representations},
year={2023},
url={https://openreview.net/forum?id=YnkGMIh0gvX}
}
- Citing This Work
- Table Of Contents
- Installation
- How to Integrate Your Own Usecase
- Reproducing our results
- Working with FD-Shifts
- Acknowledgements
FD-Shifts requires Python version 3.10 or later. It is recommended to install FD-Shifts in its own environment (venv, conda environment, ...).
-
Install an appropriate version of PyTorch. Check that CUDA is available and that the CUDA toolkit version is compatible with your hardware. The currently minimum necessary version of pytorch is v.1.11.0. Testing and Development was done with the pytorch version using CUDA 11.3.
-
Install FD-Shifts. This will pull in all dependencies including some version of PyTorch, it is strongly recommended that you install a compatible version of PyTorch beforehand. This will also make the
fd-shifts
cli available to you.pip install git+https://github.com/iml-dkfz/fd-shifts.git
To learn about extending FD-Shifts with your own models, datasets and confidence scoring functions check out the tutorial on extending FD-Shifts .
While the following section on working with FD-Shifts describes the general usage, descriptions for reproducing specific publications are documented on the respective project page:
- "A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification"
- "Understanding Silent Failures in Medical Image Classification"
- "Overcoming Common Flaws in the Evaluation of Selective Classification Systems"
To use fd-shifts
you need to set the following environment variables
export EXPERIMENT_ROOT_DIR=/absolute/path/to/your/experiments
export DATASET_ROOT_DIR=/absolute/path/to/datasets
Alternatively, you may write them to a file and source that before running
fd-shifts
, e.g.
mv example.env .env
Then edit .env
to your needs and run
source .env
To get an overview of available subcommands, run fd-shifts --help
.
For the predefined experiments we expect the data to be in the following folder
structure relative to the folder you set for $DATASET_ROOT_DIR
.
<$DATASET_ROOT_DIR>
βββ breeds
β βββ ILSVRC β ../imagenet/ILSVRC
βββ imagenet
β βββ ILSVRC
βββ cifar10
βββ cifar100
βββ corrupt_cifar10
βββ corrupt_cifar100
βββ svhn
βββ tinyimagenet
βββ tinyimagenet_resize
βββ wilds_animals
β βββ iwildcam_v2.0
βββ wilds_camelyon
βββ camelyon17_v1.0
For information regarding where to download these datasets from and what you have to do with them please check out the dataset documentation.
To get a list of all fully qualified names for all experiments in the paper, use
fd-shifts list-experiments
To run training for a specific experiment:
fd-shifts train --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2
Alternatively, run training from a custom configuration file:
fd-shifts train --config=path/to/config/file
Check out fd-shifts train --help
for more training options.
The launch
subcommand allows for running multiple experiments, e.g. filtered by dataset:
fd-shifts launch --mode=train --dataset=cifar10
Check out fd-shifts launch --help
for more filtering options. You can add custom experiment filters via the register_filter
decorator. See experiments/launcher.py for an example.
All pretrained model weights used for "A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification" can be found on Zenodo under the following links:
- iWildCam-2020-Wilds
- iWildCam-2020-Wilds (OpenSet Training)
- BREEDS-ENTITY-13
- CAMELYON-17-Wilds
- CIFAR-100
- CIFAR-100 (superclasses)
- CIFAR-10
- SVHN
- SVHN (OpenSet Training)
To run inference for one of the experiments:
fd-shifts test --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2
Analogously, with the launch
subcommand:
fd-shifts launch --mode=test --dataset=cifar10
To run analysis for one of the experiments:
fd-shifts analysis --experiment=svhn_paper_sweep/devries_bbsvhn_small_conv_do1_run1_rew2.2
To run analysis over an already available set of inference outputs the outputs have to be in the following format:
For a classifier with d
outputs and N
samples in total (over all tested
datasets) and for M
dropout samples
raw_logits.npz
Nx(d+2)
0, 1, ... dβ1, d, d+1
βββββββββββββββββββββββββββββββββ¬ββββββββ¬ββββββββββββββ
| logits_1 | label | dataset_idx |
βββββββββββββββββββββββββββββββββΌββββββββΌββββββββββββββ€
| logits_2 | label | dataset_idx |
βββββββββββββββββββββββββββββββββΌββββββββΌββββββββββββββ€
| logits_3 | label | dataset_idx |
βββββββββββββββββββββββββββββββββ΄ββββββββ΄ββββββββββββββ
.
.
.
βββββββββββββββββββββββββββββββββ¬ββββββββ¬ββββββββββββββ
| logits_N | label | dataset_idx |
βββββββββββββββββββββββββββββββββ΄ββββββββ΄ββββββββββββββ
external_confids.npz
Nx1
raw_logits_dist.npz
NxdxM
0, 1, ... dβ1
βββββββββββββββββββββββββββββββββ
| logits_1 (Dropout Sample 1) |
| logits_1 (Dropout Sample 2) |
| . |
| . |
| . |
| logits_1 (Dropout Sample M) |
βββββββββββββββββββββββββββββββββ€
| logits_2 (Dropout Sample 1) |
| logits_2 (Dropout Sample 2) |
| . |
| . |
| . |
| logits_2 (Dropout Sample M) |
βββββββββββββββββββββββββββββββββ€
| logits_3 (Dropout Sample 1) |
| logits_3 (Dropout Sample 2) |
| . |
| . |
| . |
| logits_3 (Dropout Sample M) |
βββββββββββββββββββββββββββββββββ
.
.
.
βββββββββββββββββββββββββββββββββ
| logits_N (Dropout Sample 1) |
| logits_N (Dropout Sample 2) |
| . |
| . |
| . |
| logits_N (Dropout Sample M) |
βββββββββββββββββββββββββββββββββ
external_confids_dist.npz
NxM
To load inference output from different locations than $EXPERIMENT_ROOT_DIR
, you can specify one or multiple directories in the FD_SHIFTS_STORE_PATH
environment variable (multiple paths are separated by :
):
export FD_SHIFTS_STORE_PATH=/absolute/path/to/fd-shifts/inference/output
You may also use the ExperimentData
class to load your data in another way.
You also have to provide an adequate config, where all test datasets and query
parameters are set. Check out the config files in fd_shifts/configs
including
the dataclasses. Importantly, the dataset_idx
has to match up with the list of
datasets you provide and whether or not val_tuning
is set. If val_tuning
is
set, the validation set takes over dataset_idx=0
.