Skip to content

Minutes Data Working Group 9 Jul 2020

Brad edited this page Jul 21, 2020 · 1 revision

Agenda

Minutes

  • Brad to prepare content for next week’s MONAI board meeting on activities / motion to adopt
  • Update from joint discussion with Evaluation, Reproducibility & Benchmarking workgroup
    • Effort to get challenges data structured properly for sharing
    • Data work group looks at the “what” in this data getting in
    • Difficult to isolate pure data properties related to the experiments
      • E.g., although benchmarking is looking at this problem, with cross validation, some data fields like “type of scanner” might not be in the scope of the workgroup
    • Can DICOM help?
      • Potentially - there are fields that could be populated - but this isn’t a solution for all kinds of data - what about when we leave DICOM?
        • E.g., what about when it is NIFTI - no metadata is included, and there’s only a free-text header field limited to 80 characters.
  • Compilation task should give researchers the ability to extract painlessly
  • Reproducible IO
    • I/O Working Group looking at kicking out a file to repeat a training session
    • Look at MLFlow (https://mlflow.org/) - they capture entire environment and flows down to network architecture
    • Some networks may not need to be reproducible
    • Also consider factors that affect reproducibility like hardware and drivers
  • Compilation pipeline - is it possible to detect when data was transformed "different" than how someone else transformed the data
    • E.g., a warning on "damaging the data"
  • MSD in MONAI 0.2
    • Random seed feature - training data is separated to validation / cross validation
    • Implemented "import MSD dataset" - this pulls the data and parses the JSON
    • Automatic / grid search - learning rate - compute cross validation experiments
  • “Read-only” Superset of Data
    • Experiments derive the data they are to run - which could be a subset of that data
    • Is the read-only version of that data converted on every experiment, or is it computed at run-time?
      • Boundary between super-optimized and super-flexible code
    • Representations that can be regenerated - should it be cached? (if so, consider using checksums to make sure it is still valid)
Clone this wiki locally