GUI Manual

GUI User Guide

The tool has a tab for each data type (tabular, DICOM) and the usage of the GUI is a straightforward procedure.

Data Validation and Cleaning

tabular tab

Data Validation

First we have to load the dataset csv file.

If we need to validate the data, we need to have a json file containing the table schema (metadata) of the dataset csv file. This json must may follows either a modified frictionless data table-schema specifications or MIP's Data-Catalogue specification schema.

About frictionless schema, the modified schema of that json file can be found here. In short, it is a simple modification that adds MIPType property in the field json object. The acceptable values for this property are text, numerical, integer, nominal and date, depending on the original field object type property value.

Also, in the case where the dataset belongs to a certain MIP's pathology, there is the option to download the dataset's CDE schema from MIP Data Catalogue. The user may save the schema file in a local drive as a json file.

For the report, there are two options file formats:

pdf
excel (xlsx)

For more details about the content of the above files, please refer to the Report Files Descriptions and Details wiki section.

outlier threshold input field is related with the outlier detection for numerical variables of the incoming dataset. The way that the Data Quality Control tool handles the outlier detection of a certain numerical variable, is that first calculates the mean and the standard deviation based on the valid values of that column and then calculates the upper and the lower limit by the formula: upper_limit = mean + outlier threshold * standard deviation, lower_limit = mean - outlier threshold * standard deviation. If any value is outside those limits then it is considered as an outlier.

The report file will be saved in the given output folder, by clicking Create Report button. The name of the output file will be:

<dataset>_report.pdf

Data Cleaning

cleaning suggestions window

After reviewing the Cleaning Suggestions either by clicking the button Show cleaning suggestions or by referring the Suggested corrections section in the Data Validation report created in the previous step (Please refer to the Report Files Descriptions and Details wiki section for further details), we can proceed with the data cleaning operation by clicking the Perform Cleaning button. The cleaned dataset file will be saved in the report folder using the original dataset name with the addition of the suffix '_corrected'.

For further details about the validating and cleaning procedure of the DQC tool please refer to the Validation & Cleaning functionality per datatype wiki section.

Dataset Schema Inference

Inference tab

In this tab we can infer a dataset's schema and save it to the local disk. The schema could be saved in two formats:

Frictionless spec json
Data Catalogue's spec Excel (xlsx) file, that can be used for creating a new CDE pathology version.

In the infer option section, we give the number of rows that the tool will based on for the schema inference. Also, we declare the maximum number of categories that a nominal MIPType variable can have.

If we choose the Data Catalogue's excel as an output, the tool offers the option of suggesting CDE variables for each column of the incoming dataset. This option is possible, only when a CDE dictionary is provided. This dictionary is an excel file that contains information for all the CDE variables that are included or will be included in the MIP (this dictionary will be available in the Data Catalogue in the near future). The tool calculates a similarity measure for each column based on the column name similarity (80%) and the value range similarity (20%). The similarity measure takes values between 0 and 1. In the field similarity threshold we can define the minimum similarity measure between an incoming column and a CDE variable that need to be met in order the tool to suggest that CDE variable as a possible correspondence. The tool stores those CDE suggestions in the excel file in the column named CDE and also stores the corresponding concept path under the column conceptPath.

TODO add infer only emtpy strings as NAs expaination

Data Mapping

Data mapping tab

If a hospital wants to participate in a MIP Federation of a certain pathology, the hospital's data must be harmonized accordingly. The goal of data mapping is transforming the hospital's local variables to a set of variables of a target CDE pathology.

Step 1 - Target Schema Selection

step 1 image

First, we select the target CDE schema (Pathology). This can be loaded from a json file stored in a local disk or can be downloaded from MIP Data Catalogue API, which is the more convenient option.

step 1b image

If we choose the latter, then we select the From DataCatalogue tick box and then click the Get pathologies button to retrieve all Pathologies metadata that are currently stored in the Data Catalogue. Then, we select first a Pathology from the drop down menu, and then a CDE version for the selected pathology from the drop down menu next to the first one. Optionally, we can save the selected CDE schema in a Data Catalogue's spec json file.

Step 2 Source csv infer options

step 2 image

Currently, the tool does not support loading an existing source csv schema from a file, thus the source schema must be inferred first. As in the Dataset Schema Inference mentioned above, here we give the number of rows that the tool will based on for the source csv schema inference. Also, we declare the maximum number of categories that a nominal MIPType variable can have.

Also there is the option to load a CDE Dictionary, as in the Dataset Schema Inference. Here, with the use of a CDE Dictionary the tool can suggests mappings between the incoming hospital variables and CDE variables. Note Among the suggestions may be CDE variables that are not included in the selected Pathology from the previous step, in that case a warning message will be appeared in the console output.

Step 3 Data Mapping Configuration

This step is the longest from the three and here the actual mapping design occurs.

step 3 image

First, we select the Source CSV file by clicking the Select and Create button. If we have loaded a CDE Dictionary in the previous step, then the Suggest CDE correspondences will be activated. If we click that button, the tool will try to find CDE correspondences for the incoming dataset variables, as described in the previous step.

step 3b image

If we have run the CDE suggestions and some CDE correspondences (or mappings) have been found, those will be appeared in the CDE selection box. In the 'Source Columns` box we can see the source variables for each mapping.

Next to these selection boxes there are three buttons:

Add, for creating a new mapping
Edit, for editing an existing mapping
Remove, for deleting an existing mapping

In the last two cases, to select a mapping we MUST select the CDE name from the CDE select box and NOT from the Source Column select box.

Please note the labels below the buttons:

CDEs mapped is the number of CDEs that has been mapped to one or more source variables.
CDEs not mapped is the number of CDEs that are not mapped.

The new mapping window

create mapping1

In the CDE section we select the CDE variable that we are going to map. Then, on the Source Column section we can select variables from the
source schema and by pressing the plus button we can add this variable to the expression box.

source columns section

Below, we can see an example expression involving 3 source columns from the source dateset named 'adni'.

expression example

Example of creating 1 to 1 mapping with nominal variables

Let's try to create a new mapping between a nominal source variable and a nominal CDE variable. In our example we used as a target schema the Dementia pathology version 5, retrieved from the Data Catalogue.

create mapping1

The CDE variable that we are going to map is the DIAG_etiology_1, so we select it from the CDE drop-down box.

create mapping functions

The CDE variable that we are going to map is the DIAG_etiology_1, so we select it from the CDE drop-down box. Then we select the nominal source variable rs190982_g. Please note, the mapping between these two variables doesn't not have any scientific point, we have just picked two nominal variables for the needs of this example.

In the Mipmap Function section we can select a function and by clicking at the plus button we can insert the function expression into the expression box.

Because we have a mapping between two nominal variables we are going to use the Category Replacement Function.

create mapping step1

create mapping step2

create mapping step3

create mapping step4

Example of editing an existing mapping

edit mapping

edit mapping2

edit mapping3

edit mapping4

Final step - Executing the mapping / Data Transformation

DICOM Tab

image dicom GUI

Data Validation

We select the Dicom Root Folder where all the DICOM are stored. It is assumed that for each patient there is a subfolder containing all the MRI dcm files, note that a patient could have more than one MRI. Then, we select the the Output Report Folder where the report files will be placed. If the folder does not exist, the tool will create it. Then, we press the Create Report button.

The tool creates in the <report folder>, the pdf report file dicom_report.pdf and, depending of the results, also creates the following csv files :

validsequences.csv
invalidsequences.csv
invaliddicoms.csv
notprocessed.csv
mri_visits.csv

The above files are created even if no valid/invalid sequences/dicoms files have been found. In such case, the files will be empty. Detailed description for the content of these files can be found on the Report Files - Description and Details wiki section.

Data Cleaning

If we want to filter out the invalid MRI sequences and reorganize the dcm files of the valid MRIs in a suitable folder structure for importing them into LORIS-for-MIP, repeat the previous step and select the Reorganize files for Loris pipeline check button.

For the LORIS pipeline the dcm files are reorganized and stored in a folder structure <ouput_report_folder>/<patientid>/<patientid_visitcount>. All the dcm sequence files that belong to the same scanning session (visit) are stored in the common folder <patientid_visitcount>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GUI Manual

GUI User Guide

Data Validation and Cleaning

Data Validation

Data Cleaning

Dataset Schema Inference

Data Mapping

Step 1 - Target Schema Selection

Step 2 Source csv infer options

Step 3 Data Mapping Configuration

The new mapping window

Example of creating 1 to 1 mapping with nominal variables

Example of editing an existing mapping

Final step - Executing the mapping / Data Transformation

DICOM Tab

Data Validation

Data Cleaning

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally