This folder contains the code for data organization, splitting, preprocessing, and checkpoint converting.
Since the orginal data formats and folder structures vary greatly across different dataset, we need to organize them as unified structures, allowing to use the same functions for data preprocessing. The expected folder structures are as follows:
3D nii data
----dataset_name
--------images
------------xxxx_0000.nii.gz
------------xxxx_0000.nii.gz
--------labels
------------xxxx.nii.gz
------------xxxx.nii.gz
Note: you can also use different suffix for images and labels. Please change them in the following preprocessing scripts as well.
2D data
----dataset_name
--------images
------------xxxx.jpg/png
------------xxxx.jpg/png
--------labels
------------xxxx.png
------------xxxx.jpg/png
video data
----dataset_name
--------images
------------video1
----------------xxxx.png
------------video2
----------------xxxx.png
--------labels
------------video1
----------------xxxx.png
------------video2
----------------xxxx.png
Unfortunately, it is impossible to have one script to finish all the data organization. We manually organized with commonly used data format converting functions, including dcm2nii
, mhd2nii
, nii2nii
, nrrd2nii
, jpg2png
, tif2png
, rle_decode
. These functions are available at format_convert.py
For common 2D images (e.g., skin cancer demoscopy, chest X-Ray), they can be directly separated into 80%/10%/10% for training, parameter tuning, and internal validation, respectively. For 3D images (e.g., all the MRI/CT scans) and video data, they should be split in the case/video level rather than 2D slice/frame level. For 2D whole-slide images, the splitting is in the whole-slide level. Since they cannot be directly sent to the model because of the high resolution, we divided them into patches with a fixed size of 1024x1024
after data splitting.
After finishing the data organization, the data splitting can be easily done by running
python split.py
Please set the proper data path in the script. The expected folder structures (e.g., 3D images) are
----dataset_name
--------images
------------xxxx_0000.nii.gz
------------xxxx_0000.nii.gz
--------labels
------------xxxx.nii.gz
------------xxxx.nii.gz
--------validation
------------images
----------------xxxx_0000.nii.gz
----------------xxxx_0000.nii.gz
------------labels
----------------xxxx.nii.gz
----------------xxxx.nii.gz
--------testing
------------images
----------------xxxx_0000.nii.gz
----------------xxxx_0000.nii.gz
------------labels
----------------xxxx.nii.gz
----------------xxxx.nii.gz
All the images will be preprocessed as npy
files. There are two main reasons for choosing this format. First, it allows fast data loading (main reason). We learned this point from nnU-Net. Second, numpy file is a universal data interface to unify all the different data formats. For the convenience of debugging and inference, we also saved the original images and labels as npz
files. Spacing information is also saved for CT and MR images.
The following steps are applied to all images
- max-min normalization
- resample image size to 1024x2014
- save the pre-processed images and labels as npy files
Different modalities also have their own additional pre-process steps based on the data features.
For CT images, we fist adjust the window level and width following the common practice.
- Soft tissue window level (40) and width (400)
- Chest window level (-600) and width (1500)
- Brain window level (40) and width (80)
For MR and ultrasound, mammography, and Optical Coherence Tomography (OCT) images (i.e., ultrasound), we apply intensity cut-off with 0.5 and 99.5 percentiles of the foreground voxels. Regarding RGB images (e.g., endoscopy, dermoscopy, fundus, and pathology images), if they are already within the expected intensity range of [0, 255], their intensities remained unchanged. However, if they fell outside this range, max-min normalization was applited to rescale the intensity values to [0, 255].
Preprocess for CT/MR images:
python pre_CT_MR.py
Preprocess for grey and RGB images:
python pre_grey_rgb.py
Note: Please set the corresponding folder path and molidaty information. We provided an example in the script.
Data ensembling of different training datasets is very simple. Since all the training data are converted into npy
files during preprocessing, you just need to merge them into one folder.
If the model is trained with multiple GPUs, please use the script ckpt_convert.py
to convert the format since users only use one GPU for model inference in real practice.
Set the path to sam_ckpt_path
, medsam_ckpt_path
, and save_path
and run
python ckpt_convert.py