Skip to content

CLI Manual

Iosif Spartalis edited this page Sep 30, 2021 · 17 revisions

Profiling and Validating a CSV dataset:

Usage: qctool csv <options> <csv file> <schema json>

  This command produces a validation report for <csv file>.

  The report file is stored in the same folder where <csv file> is located.

  <schema json> file MUST be compliant with frirctionless   data table-
  schema specs(https://specs.frictionlessdata.io/table-schema/) or   with
  Data Catalogue json format.

Options:
  --clean                 Flag for performing data cleaning.The cleaned file will 
                          be saved in the report folder.

  -m, --metadata [dc|qc]  Select "dc" for Data Catalogue spec json
                          or "qc" for frictionless spec json.

  -r, --report [xls|pdf]  Select the report file format.
  -o, --outlier FLOAT     outlier threshold in standard deviations.
  --help                  Show this message and exit.

Options further explanation

-o, --outlier, Outlier Threshold

This input field is related with the outlier detection for numerical variables of the incoming dataset. The way that the Data Quality Control tool handles the outlier detection of a certain numerical variable, is that first calculates the mean and the standard deviation based on the valid values of that column and then calculates the upper and the lower limit by the formula:

upper_limit = mean + outlier threshold * standard deviation, lower_limit = mean - outlier threshold * standard deviation. If any value is outside those limits then it is considered as an outlier.

The report file will be saved in the folder where the incoming dataset file is located.

Data Cleaning

After reviewing the Data Validation report created in the previous step (Please refer to the Data Validation Report wiki section for further details), we can proceed with the data cleaning operation. The cleaned dataset file will be saved in same folder where the incoming dataset is located by using the original dataset name with the addition of the suffix '_corrected'.

Inference of a CSV dataset's schema

Usage: qctool infercsv <options> <csv file>

  This command infers the schema of the <csv file> it and stored in <output
  file>.

  The <output file> either a json file following the frictionless data
  specs(https://specs.frictionlessdata.io/table-schema/) or an xlsx file
  following MIP Data Catalogue's format.

Options:
  --max_levels INTEGER         Max unique values of a text variable
                               that below that will be infered as nominal
                               [default: 10]

  --sample_rows INTEGER        Number rows that are going to be used as sample
                               for infering the dataset metadata (schema)
                               [default: 100]

  --schema_spec [dc|qc]        Select "dc" for Data Catalogue spec xlsx file
                               or "qc" for frictionless spec json.

  --cde_file PATH              CDE dictionary Excel file (xlsx)
  -t, --threshold FLOAT RANGE  CDE similarity threshold.
  --help                       Show this message and exit.

Options further explanation

--cde_file,--threshold, CDE Dictionary support

If we choose the Data Catalogue's excel as an output, the tool offers the option of suggesting CDE variables for each column of the incoming dataset. This option is possible, only when a CDE dictionary is provided. This dictionary is an excel file that contains information for all the CDE variables that are included or will be included in the MIP (this dictionary will be available in the Data Catalogue in the near future).

The tool calculates a similarity measure for each column based on the column name similarity (80%) and the value range similarity (20%). The similarity measure takes values between 0 and 1. With the option similarity threshold we can define the minimum similarity measure between an incoming column and a CDE variable that need to be met in order the tool to suggest that CDE variable as a possible correspondence. The tool stores those CDE suggestions in the excel file in the column named CDE and also stores the corresponding concept path under the column conceptPath.

Supported schema formats

The schema could be saved in two formats:

  1. Frictionless spec json
  2. Data Catalogue's spec Excel (xlsx) file, that can be used for creating a new CDE pathology version.

DICOM MRI metadata validation

The metadata validation is performed following the HBP MIP's minimum metadata requirements for MRIs, which can be found here.

Usage: qctool dicom <options> <dicom folder> <report folder>

  This command produces a validation report for MRIs in <dicom folder>.

  All MRI dcm files belonging to the same Patient MUST be in the same
  subfolder in <dicom folder>.

  The validation report files are stored in <report folder>.

Options:
  --loris_folder <loris input folder>
                                  LORIS input folder where the dcm files in
                                  <dicom folder> will be reorganized
  --help                          Show this message and exit.

Options further explanation

<dicom folder>

is the root folder where all DICOM files are stored. It is assumed that each subfolder corresponds to one patient.

<report folder>

is the folder where the report files will be placed. If the folder does not exist, the tool will create it.

--loris_folder

folder path where the dcm files are reorganized for LORIS pipeline

For the LORIS pipeline the dcm files are reorganized and stored in a folder structure <loris_folder>/<patientid>/<patientid_visitcount>. All the dcm sequence files that belong to the same scanning session (visit) are stored in the common folder <patientid_visitcount>.

Output files

The tool creates in the <report folder>, the pdf report file dicom_report.pdf and, depending of the results, also creates the following csv files :

  • validsequences.csv
  • invalidsequences.csv
  • invaliddicoms.csv
  • notprocessed.csv
  • mri_visits.csv

The above files are created even if no valid/invalid sequences/dicoms files have been found. In such case, the files will be empty. Detailed description for the content of these files can be found on the Report Files - Description and Details wiki section.

Clone this wiki locally