Skip to content
Iosif Spartalis edited this page Nov 2, 2022 · 51 revisions

GUI User Guide

The tool has a tab for each data type (tabular, DICOM) and the usage of the GUI is a straightforward procedure.

Data Validation and Cleaning

tabular tab

Data Validation

First we have to load the dataset csv file.

If we need to validate the data, we need to have a json file containing the table schema (metadata) of the dataset csv file. This json must may follows either a modified frictionless data table-schema specifications or MIP's Data-Catalogue specification schema.

About frictionless schema, the modified schema of that json file can be found here. In short, it is a simple modification that adds MIPType property in the field json object. The acceptable values for this property are text, numerical, integer, nominal and date, depending on the original field object type property value.

Also, in the case where the dataset belongs to a certain MIP's pathology, there is the option to download the dataset's CDE schema from MIP Data Catalogue. The user may save the schema file in a local drive as a json file.

For the report, there are two options file formats:

  • pdf
  • excel (xlsx)

For more details about the content of the above files, please refer to the Report Files Descriptions and Details wiki section.

outlier threshold input field is related with the outlier detection for numerical variables of the incoming dataset. The way that the Data Quality Control tool handles the outlier detection of a certain numerical variable, is that first calculates the mean and the standard deviation based on the valid values of that column and then calculates the upper and the lower limit by the formula: upper_limit = mean + outlier threshold * standard deviation, lower_limit = mean - outlier threshold * standard deviation. If any value is outside those limits then it is considered as an outlier.

The report file will be saved in the given output folder, by clicking Create Report button. The name of the output file will be:

  • <dataset>_report.pdf

Data Cleaning

cleaning suggestions window

The DQC tool makes cleaning suggestions for invalid values. Those suggestions could be edited or deleted by clicking the button Show cleaning suggestions. After reviewing the Cleaning Suggestions either by the previous mentioned method or by referring the Suggested corrections section in the Data Validation report created in the previous step (Please refer to the Report Files Descriptions and Details wiki section for further details), we can proceed with the data cleaning operation by clicking the Perform Cleaning button. The cleaned dataset file will be saved in the report folder using the original dataset name with the addition of the suffix '_corrected'.

For further details about the validating and cleaning procedure of the DQC tool please refer to the Validation & Cleaning functionality per datatype wiki section.

Dataset Schema Inference

Inference tab

In this tab we can infer a dataset's schema and save it to the local disk. The schema could be saved in two formats:

  1. Frictionless spec json
  2. Data Catalogue's spec Excel (xlsx) file, that can be used for creating a new CDE pathology version.

In the infer option section, we give the number of rows that the tool will based on for the schema inference. Also, we declare the maximum number of categories that a nominal MIPType variable can have.

If we choose the Data Catalogue's excel as an output, the tool offers the option of suggesting CDE variables for each column of the incoming dataset. This option is possible, only when a CDE dictionary is provided. This dictionary is an excel file that contains information for all the CDE variables that are included or will be included in the MIP (this dictionary will be available in the Data Catalogue in the near future). The tool calculates a similarity measure for each column based on the column name similarity (80%) and the value range similarity (20%). The similarity measure takes values between 0 and 1. In the field similarity threshold we can define the minimum similarity measure between an incoming column and a CDE variable that need to be met in order the tool to suggest that CDE variable as a possible correspondence. The tool stores those CDE suggestions in the excel file in the column named CDE and also stores the corresponding concept path under the column conceptPath.

NOTE The DQC tool handles the below strings as null (NAs):

  • '' (empty string)
  • #N/A
  • #N/A N/A
  • #NA
  • -1.#IND
  • -1.#QNAN
  • -NaN
  • -na
  • 1.#IND
  • 1.#QNAN
  • N/A
  • NA
  • NULL
  • NaN
  • n/a
  • nan
  • null

If we want the DQC tool to handles those values as normal strings we select the 'Only infer empty strings as NAs` box. In that case only the '' (empty string) will be handled as a null value.

Data Catalogue's Excel validation

DC validation

In this tab the user can validate a DC excel file before uploading to Data Catalogue for creating a new pathology version. Currently, the DQC tool checks the nominal variables if they have invalid enumerations and if there are duplicate concept paths. As invalid enumerations are considered all the code values (in DC excel, values column the enumeratios are stored as {"code", "label"} seperated by comma) that start with a float number (eg "1.5T", "2,4") and values that are keywords in SQL language listed in the table below. For example in {"1.5T", "One and a half tonne"} is invalid, but {"one_and_half_tonne", "1.5T"} is valid.

Also, if the DC excel validation goes well, the user could save the validated data in a DataCatalogue json format via the DC-excel 2 DC-json Conversion Menu.

SQL keyword SQL keyword SQL keyword SQL keyword SQL keyword
"ADD" "ALL" "ALTER" "AND" "ANY"
"AS" "ASC" "BACKUP" "BETWEEN" "BY"
"CASE" "CHECK" "COLUMN" "CONSTRAINT" "CREATE"
"DATABASE" "DEFAULT" "DELETE" "DESC" "DISTINCT"
"DROP" "EXEC" "EXISTS" "FOREIGN" "FROM"
"FULL" "JOIN" "GROUP" "HAVING" "IN"
"INDEX" "INNER" "INSERT" "INT" "INTO"
"IS" "LEFT" "LIKE" "LIMIT" "NOT"
"NULL" "OR" "ORDER" "OUTER" "PRIMARY"
"KEY" "PROCEDURE" "REPLACE" "RIGHT" "ROWNUM"
"SELECT" "TOP" "SET" "TABLE" "TRUNCATE"
"UNION" "ALL" "UNIQUE" "UPDATE" "VALUES"
"VIEW" "WHERE" "ASCII" "CHAR_LENGTH" "CHARACTER_LENGTH"
"CONCAT" "CONCAT_WS" "FIELD" "FIND_IN_SET" "FORMAT"
"INSTR" "LCASE" "LENGTH" "LOCATE" "LOWER"
"LPAD" "LTRIM" "MID" "POSITION" "REPEAT"
"REVERSE" "RPAD" "RTRIM" "SPACE" "SUBSTR"
"SUBSTRING" "SUBSTRING_INDEX" "TRIM" "UCASE" "UPPER"
"ABS" "ACOS" "ASIN" "ATAN" "ATAN2"
"AVG" "CEIL" "CEILING" "COS" "COT"
"COUNT" "DEGREES" "DIV" "EXP" "FLOOR"
"GREATEST" "LEAST" "LN" "LOG" "LOG10"
"LOG2" "MAX" "MIN" "MOD" "PI"
"POW" "POWER" "RADIANS" "RAND" "ROUND"
"SIGN" "SIN" "SQRT" "SUM" "TAN"
"TRUNCATE" "ADDDATE" "ADDTIME" "CURDATE" "CURRENT_DATE"
"CURRENT_TIME" "CURRENT_TIMESTAMP" "CURTIME" "DATE" "DATEDIFF"
"DATE_ADD" "DATE_FORMAT" "DATE_SUB" "DAY" "DAYNAME"
"DAYOFMONTH" "DAYOFWEEK" "DAYOFYEAR" "FROM_DAYS" "HOUR"
"LAST_DAY" "LOCALTIME" "LOCALTIMESTAMP" "MAKEDATE" "MAKETIME"
"MICROSECOND" "MINUTE" "MONTH" "MONTHNAME" "NOW"
"PERIOD_ADD" "PERIOD_DIFF" "QUARTER" "SECOND"
"SEC_TO_TIME" "STR_TO_DATE" "SUBDATE" "SUBTIME" "SYSDATE"
"TIME" "TIME_FORMAT" "TIME_TO_SEC" "TIMEDIFF" "TIMESTAMP"
"TO_DAYS" "WEEK" "WEEKDAY" "WEEKOFYEAR" "YEAR"
"YEARWEEK" "BIN" "BINARY" "CASE" "CAST"
"COALESCE" "CONNECTION_ID" "CONV" "CONVERT" "CURRENT_USER"
"IF" "IFNULL" "ISNULL" "LAST_INSERT_ID" "NULLIF"
"SESSION_USER" "SYSTEM_USER" "USER" "VERSION" /

Data Mapping

Data mapping tab

If a hospital wants to participate in a MIP Federation of a certain pathology, the hospital's data must be harmonized accordingly. The goal of data mapping is transforming the hospital's local variables to a set of variables of a target CDE pathology.

Step 1 - Target Schema Selection

step 1 image

First, we select the target CDE schema (Pathology). This can be loaded from a json file stored in a local disk or can be downloaded from MIP Data Catalogue API, which is a more convenient option.

step 1b image

If we choose the latter, then we select the From DataCatalogue tick box and then click the Get pathologies button to retrieve all Pathologies metadata that are currently stored in the Data Catalogue. Then, we select first a Pathology from the drop down menu, and then a CDE version for the selected pathology from the drop down menu next to the first one. Optionally, we can save the selected CDE schema in a Data Catalogue's spec json file.

Step 2 Source csv infer options

step 2 image

Currently, the tool does not support loading an existing source csv schema from a file, thus the source schema must be inferred first. As in the Dataset Schema Inference mentioned above, here we give the number of rows that the tool will based on for the source csv schema inference. Also, we declare the maximum number of categories that a nominal MIPType variable can have.

Also there is the option to load a CDE Dictionary, as in the Dataset Schema Inference. Here, with the use of a CDE Dictionary the tool can suggests mappings between the incoming hospital variables and CDE variables. Note Among the suggestions may be CDE variables that are not included in the selected Pathology from the previous step, in that case a warning message will be appeared in the console output.

Step 3 Data Mapping Configuration

This step is the longest from the three and here the actual mapping design occurs.

step 3 image

First, we select the Source CSV file by clicking the Select and Create button. If we have loaded a CDE Dictionary in the previous step, then the Suggest CDE correspondences will be activated. If we click that button, the tool will try to find CDE correspondences for the incoming dataset variables, as described in the previous step.

step 3b image

If we have run the CDE suggestions and some CDE correspondences (or mappings) have been found, those will be appeared in the CDE selection box. In the 'Source Columns` box we can see the source variables for each mapping.

Next to these selection boxes there are three buttons:

  • Add, for creating a new mapping
  • Edit, for editing an existing mapping
  • Remove, for deleting an existing mapping

In the last two cases, to select a mapping we MUST select the CDE name from the CDE select box and NOT from the Source Column select box.

NOTE Due to GUI limitations, the scrolling in the mappings boxes by using the mouse wheel is not synchronized. For synchronized scrolling, please use the vertical scrolling bar.

Please note the labels below the buttons:

  • CDEs mapped is the number of CDEs that has been mapped to one or more source variables.
  • CDEs not mapped is the number of CDEs that are not mapped.

The new mapping window

create mapping1

In the CDE section, we select the CDE variable that we are going to map. Then, on the Source Column section we can select variables from the source schema and by pressing the plus button we can add this variable to the expression box. Please note how the source column names are represented in the expression box below.

source columns section

Below, we can see an example expression involving 3 source columns from the source dateset named 'adni'. The transformation is a quite simple one. We add the values of the variables leftventaldc, leftaccumensarea and rightpallidum together and divide the sum by 2 and the we subtract the value of the variable leftgregyrusrectus.

expression example

We can create more complicated expressions by inserting functions supported by MIPMap from the Functions section. By pressing the plus button the selected function expression will be inserted at the expression box with some dummy arguments. We MUST replace those arguments with the ones that we want we use, for example a source variable or a number. Caution, the argument MUST be of compatible data type (str, number, date) with the ones that are supported by the selected function. Please refer to the MIPMap supported functions wiki section for further details.

create mapping functions

The Category Replacement Function is a special case of function that is used for in a mapping between two nominal variables (variables that has enumerated values). Let's see the following example to understand how it works.

Example of creating 1 to 1 mapping with nominal variables

Let's try to create a new mapping between a nominal source variable and a nominal CDE variable. In our example we used as a target schema the Dementia pathology version 5, retrieved from the Data Catalogue.

create mapping step1

The CDE variable that we are going to map is the DIAG_etiology_1, so we select it from the CDE drop-down box. Then we select the nominal source variable rs190982_g. Please note, the mapping between these two variables doesn't not have any scientific point, we have just picked two nominal variables for the needs of this example. In the text boxes below the variable names, we can see the enumerations of each variable. Because we have a mapping between two nominal variables we are going to use the Category Replacement Function.

create mapping step2

First, we select the source variable enumeration (the number 1 in our case) from the left box that we want to replaced, then we type in the box Replace with the target enumeration. In our example the target enumeration is Pending_diagnosis. Then click the -> button. After that we can proceed with the second enumeration (the number 2) which is going to be replaced with the target enumeration Not_applicable. Note, if we want to delete an enumeration replacement, we select it from the right box and then we click on the <- button.

create mapping step3

After creating all the possible enumeration replacements, we can click the + button to produce the MIPMap expression. The outcome of the Category Replacement Function is consisted of multiple 'if' statements. Caution, It is not recommended changing it. After that, we can press the save button in order to save the current mapping.

create mapping step4

Example of editing an existing mapping

Now, let's edit an existing mapping that was created automatically by the Suggest CDE correspondences functionality. We select the rightfofrontaloperculum CDE mapping and we click the edit button.

selecting mapping to edit

At the editing window, we observed that the CDE section is grey out.

edit mapping

At the moment, we observed that only one source variable is participating in the mapping and its value will be assigned to the target CDE unchanged. Let's assume that the final value for the CDE rightfofrontaloperculum will be the result of two source variables. To do so, we select the rightventaldc variable from the Source Column section with out clicking the + button.

edit mapping3

Now, let's say that we want the final value of the CDE will be the result from the following formula:

(rightfofrontaloperculum + cos(rightventaldc)) / 2

To do so, we select the cosine function from MIPMap Function and press the + button to add the function expression to the Expression box. Note, that the expression will be insert in the position that the text cursor currently is. After that, we delete the x default argument from the cosine expression. Leave the text cursor inside the parentheses. Then, we click on the + button at the Source Column section which previously selected the rightventaldc varialble.

edit mapping2

After making the necerery edits in the Expression box, the final MIPMap expression would be like this:

edit mapping4

We click the save button to save the modified mapping.

Final step - Executing the mapping / Data Transformation

If we are satisfied with the mappings, we can proceed to the actual data transformation. We select the local folder that we want to save the transformed csv file. NOTE, we have to have WRITE rights on the selected folder. Then, we click the Run Mapping Taskbutton to execute the transformation.

There is the option to save the mappings to an XML file compatible with the MIPMap format, if the user wants to edit the mappings by using the MIPMapGUI application and execute the data transformation there.

NOTE The Data Transformation is done by a dockerized MIPMap, a software written in java, on the background.


DICOM Tab

image dicom GUI

Data Validation

We select the Dicom Root Folder where all the DICOM are stored. It is assumed that for each patient there is a subfolder containing all the MRI dcm files, note that a patient could have more than one MRI. Then, we select the the Output Report Folder where the report files will be placed. If the folder does not exist, the tool will create it. Then, we press the Create Report button.

The tool creates in the <report folder>, the pdf report file dicom_report.pdf and, depending of the results, also creates the following csv files :

  • validsequences.csv
  • invalidsequences.csv
  • invaliddicoms.csv
  • notprocessed.csv
  • mri_visits.csv

The above files are created even if no valid/invalid sequences/dicoms files have been found. In such case, the files will be empty. Detailed description for the content of these files can be found on the Report Files - Description and Details wiki section.

Data Cleaning

If we want to filter out the invalid MRI sequences and reorganize the dcm files of the valid MRIs in a suitable folder structure for importing them into LORIS-for-MIP, repeat the previous step and select the Reorganize files for Loris pipeline check button.

For the LORIS pipeline the dcm files are reorganized and stored in a folder structure <ouput_report_folder>/<patientid>/<patientid_visitcount>. All the dcm sequence files that belong to the same scanning session (visit) are stored in the common folder <patientid_visitcount>.