-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Minutes Data Working Group 9 Jul 2020
Brad edited this page Jul 21, 2020
·
1 revision
- Update to MONAI Board on Data Working Group activities
- Discuss synergy with Evaluation, Reproducibility & Benchmarking workgroup
- Breadth of data: https://arxiv.org/ftp/arxiv/papers/2005/2005.03501.pdf
- Bias of data (patient population, equipment): https://arxiv.org/ftp/arxiv/papers/1910/1910.04071.pdf (see Methods 24a)
- Use of ontologies: https://arxiv.org/pdf/2003.10299.pdf
- Discuss desired data structure pipeline from MONAI 0.2 onward
- Check-in on Data Working Group for additional scope for definition
- Brad to prepare content for next week’s MONAI board meeting on activities / motion to adopt
- Update from joint discussion with Evaluation, Reproducibility & Benchmarking workgroup
- Effort to get challenges data structured properly for sharing
- Data work group looks at the “what” in this data getting in
- Difficult to isolate pure data properties related to the experiments
- E.g., although benchmarking is looking at this problem, with cross validation, some data fields like “type of scanner” might not be in the scope of the workgroup
- Can DICOM help?
- Potentially - there are fields that could be populated - but this isn’t a solution for all kinds of data - what about when we leave DICOM?
- E.g., what about when it is NIFTI - no metadata is included, and there’s only a free-text header field limited to 80 characters.
- Potentially - there are fields that could be populated - but this isn’t a solution for all kinds of data - what about when we leave DICOM?
- Compilation task should give researchers the ability to extract painlessly
- Reproducible IO
- I/O Working Group looking at kicking out a file to repeat a training session
- Look at MLFlow (https://mlflow.org/) - they capture entire environment and flows down to network architecture
- Some networks may not need to be reproducible
- Also consider factors that affect reproducibility like hardware and drivers
- Compilation pipeline - is it possible to detect when data was transformed "different" than how someone else transformed the data
- E.g., a warning on "damaging the data"
- MSD in MONAI 0.2
- Random seed feature - training data is separated to validation / cross validation
- Implemented "import MSD dataset" - this pulls the data and parses the JSON
- Automatic / grid search - learning rate - compute cross validation experiments
- “Read-only” Superset of Data
- Experiments derive the data they are to run - which could be a subset of that data
- Is the read-only version of that data converted on every experiment, or is it computed at run-time?
- Boundary between super-optimized and super-flexible code
- Representations that can be regenerated - should it be cached? (if so, consider using checksums to make sure it is still valid)