-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Evaluation, Reproducibility, Benchmarks Meeting 23
AReinke edited this page Nov 22, 2023
·
2 revisions
Date: 27th September 2023
- What happens after selecting metrics
- 3 step approach:
-
- Work on experimental evidence
-
- Perform Delphi process
-
- Have implementation in MONAI/scikit-learn (no conflict between the two)
-
- Comment from Nicola: Have MONAI Core group in the loop to decide whether the implementation should be done at MONAI Core directly or as an independent package (Nicola will present the project to them)
- Project scope:
- Assumptions:
- Performance varies across tasks, organs, algorithms focused on the same task and/or organ, seeds, data sampling, test sample size, training sample size, internal validation methods (e.g. k-fold CV, type of train/test splits), hyperparameter tuning at learning stage
- Extras: number of folds during k-fold CV, data augmentation
- Idea: Evaluate sources of variance at testing (pre-trained models + whole benchmarking process)
- Assumptions:
- Two tracks: Pre-trained models and models that require training (“learners”)
- [Bouthillier et al., 2021]:
- Extra Findings: Differences due to arbitrary factors (e.g. data order), major contributors to uncontrolled variation in model performance
- Idea: evaluation within the context of medical imaging. Explore variance across tasks, organs, methods (following the whole benchmarking process, incl. model training)
- Experiments: sources of variance at testing
- Focus on segmentation: Decathlon data (2018)
- By exploring wide range of experiments, derive estimates (approx. “rules of thumb”):
-
- Expected variance depending on sample size
- Challenge such rules of thumb with learning curves using: only test data (pre-trained models and both test and training data (training from scratch, i.e. “learners” )
-
- Expected uncertainty (as expressed by variance) across tasks, organs, modalities, variance of a metric across a test set
- => Comment on sample size: if we leave out samples, we may exclude difficult cases/outliers
- => Experiments should contain two tracks: 1) preserving prevalences in the data and 2) disregarding to highlight effects
- Take many different approaches into account to get evidence on our rules of thumb
-
- Further idea: “Can I trust the dataset I have tested on?” => is the dataset representative? * For now: assume that the data we have is representative
- Sources of variance at testing
- Test set sample size
- Type of organ
- Type of algorithm
- Heterogeneity of images within test set
- Think of extending the list with dataset points
- Moving beyond test set:
- Learning phase: seeds, data sampling, training sample size, internal validation methods (e.g. k-fold CV, type of train/test splits), hyperparameter tuning, number of folds during k-fold CV, data augmentation
- Extra: effect of empirical standard deviation across folds of a k-fold CV
- Show how model selection (here: ranking) changes as a function of variability parameters (e.g., sample size)
- Eva did initial experiments with the decathlon data and plotted mean performance vs CI width (across all tasks for each algorithm): trend shows: the higher the mean, the smaller the CI
- For future: stratify per task
- Connect algorithm points across the two metrics (color-coding?) => See how order is preserved across metrics
- Same scale on both axes
- Recalculate rankings (Annika will provide the code)
- Endgoal: Provide concrete guidance and recommendations on how to best evaluate an algorithm
- Trusting algorithm results:
- Model A vs Model B
- Acceptable variability in model performance at testing
- Dataset size
- Task
- Trade-off: Clinical usefulness vs best model computationally (predictive performance) => This point was decided to be out of scope
- Link to reproducibility actions (at least statistical reproducibility)
- Open question: how would we support users in following recommendations?
- Missing/further thoughts:
- Hierarchical data aggregation
- How to construct confidence intervals
- Can we predict CIs based on limited data
- Keep in mind reality (large models, resources, carbon footprint)
- Consortium: Identify 3-4 most common use cases on fingerprint level that we can implement