-
Couldn't load subscription status.
- Fork 4
CLI Manual
Usage: qctool csv <options> <csv file> <schema json>
This command produces a validation report for <csv file>.
The report file is stored in the same folder where <csv file> is located.
<schema json> file MUST be compliant with frirctionless data table-
schema specs(https://specs.frictionlessdata.io/table-schema/) or with
Data Catalogue json format.
Options:
--clean Flag for performing data cleaning.The cleaned file will
be saved in the report folder.
-m, --metadata [dc|qc] Select "dc" for Data Catalogue spec json
or "qc" for frictionless spec json.
-r, --report [xls|pdf] Select the report file format.
-o, --outlier FLOAT outlier threshold in standard deviations.
--help Show this message and exit.This input field is related with the outlier detection for numerical variables of the incoming dataset. The way that the Data Quality Control tool handles the outlier detection of a certain numerical variable, is that first calculates the mean and the standard deviation based on the valid values of that column and then calculates the upper and the lower limit by the formula:
upper_limit = mean + outlier threshold * standard deviation, lower_limit = mean - outlier threshold * standard deviation. If any value is outside those limits then it is considered as an outlier.
The report file will be saved in the folder where the incoming dataset file is located.
After reviewing the Data Validation report created in the previous step (Please refer to the Data Validation Report wiki section for further details), we can proceed with the data cleaning operation. The cleaned dataset file will be saved in same folder where the incoming dataset is located by using the original dataset name with the addition of the suffix '_corrected'.
Usage: qctool infercsv <options> <csv file>
This command infers the schema of the <csv file> it and stored in <output
file>.
The <output file> either a json file following the frictionless data
specs(https://specs.frictionlessdata.io/table-schema/) or an xlsx file
following MIP Data Catalogue's format.
Options:
--max_levels INTEGER Max unique values of a text variable
that below that will be infered as nominal
[default: 10]
--sample_rows INTEGER Number rows that are going to be used as sample
for infering the dataset metadata (schema)
[default: 100]
--schema_spec [dc|qc] Select "dc" for Data Catalogue spec xlsx file
or "qc" for frictionless spec json.
--cde_file PATH CDE dictionary Excel file (xlsx)
-t, --threshold FLOAT RANGE CDE similarity threshold.
--help Show this message and exit.If we choose the Data Catalogue's excel as an output, the tool offers the option of suggesting CDE variables for each column of the incoming dataset. This option is possible, only when a CDE dictionary is provided. This dictionary is an excel file that contains information for all the CDE variables that are included or will be included in the MIP (this dictionary will be available in the Data Catalogue in the near future).
The tool calculates a similarity measure for each column based on the column name similarity (80%) and the value range similarity (20%). The similarity measure takes values between 0 and 1. With the option similarity threshold we can define the minimum similarity measure between an incoming column and a CDE variable that need to be met in order the tool to suggest that CDE variable as a possible correspondence. The tool stores those CDE suggestions in the excel file in the column named CDE and also stores the corresponding concept path under the column conceptPath.
The schema could be saved in two formats:
- Frictionless spec json
- Data Catalogue's spec Excel (xlsx) file, that can be used for creating a new CDE pathology version.
The metadata validation is performed following the HBP MIP's minimum metadata requirements for MRIs, which can be found here.
Usage: qctool dicom <options> <dicom folder> <report folder>
This command produces a validation report for MRIs in <dicom folder>.
All MRI dcm files belonging to the same Patient MUST be in the same
subfolder in <dicom folder>.
The validation report files are stored in <report folder>.
Options:
--loris_folder <loris input folder>
LORIS input folder where the dcm files in
<dicom folder> will be reorganized
--help Show this message and exit.is the root folder where all DICOM files are stored. It is assumed that each subfolder corresponds to one patient.
is the folder where the report files will be placed. If the folder does not exist, the tool will create it.
folder path where the dcm files are reorganized for LORIS pipeline
For the LORIS pipeline the dcm files are reorganized and stored in a folder structure <loris_folder>/<patientid>/<patientid_visitcount>.
All the dcm sequence files that belong to the same scanning session (visit) are stored in the common folder <patientid_visitcount>.
The tool creates in the <report folder>, the pdf report file dicom_report.pdf and, depending of the results, also creates the following csv files :
- validsequences.csv
- invalidsequences.csv
- invaliddicoms.csv
- notprocessed.csv
- mri_visits.csv
The above files are created even if no valid/invalid sequences/dicoms files have been found. In such case, the files will be empty. Detailed description for the content of these files can be found on the Report Files - Description and Details wiki section.