-
Couldn't load subscription status.
- Fork 4
Report Files Descriptions and Details
The Data Validation report has 2 major report types:
- Report with dataset's overall statistics and rows percentages about data completion and validation.
- Report for each dateset column with statistics about the values, validation results and suggestions for correcting the invalid data.
If Data Cleaning has been performed, then both reports are slightly different from the original ones, containing additional information about the data cleaning operation.
Here are the templates of those 2 types of reports with explanations for each subsection
This QC Report is created on: Date we run the data validation
Version of Data Quality Control Tool: version
File path: the file path location of the dataset csv file
Total number of rows: integer, Total number of columns: integer
Is metadata JSON file with table schema provided? Yes/No
Data Cleaning performed? Yes/No
Number of rows with invalid values: the total number of rows having at least one column with invalid data
Rows Data Completeness Overall Statistics Table with distribution of number of rows against number of filled columns per row
Rows Data Validation Overall Statistics Table with distribution of number of rows against number of filled -with valid data- columns per row
| Type | Text/Integer/Numerical/Nominal/Date |
| Total number of rows | column size in rows |
| Number of rows with data | how many filled values the column has |
| Completion percentage | The % ration of (filled values/column size) |
| Number of rows with constrain violation | how many rows contain a value that violates a specified value restriction (like min, max, enum etc) which is described in the dateset schema |
| Number of rows with datatype violation | how many rows contain a value that its type differs from the type specified for this particular column in the dataset's schema. |
| Data Cleansing applied? | Yes/No |
The statistics vary depending on the column datatype.
| statistic field | explanation |
|---|---|
| Mode | most frequent value in the column |
| Number of occurrences for the most frequent value (mode) | |
| Minimum value | |
| Maximum value | |
| 25% of records are below this value(limit value of the first quartile) | the middle number between the smallest number and the median of the set of values |
| 50% of records are below this value (median) | the median of the set of values / 50% of the data lies below this point |
| 75% of records are below this value (limit value of the third quartile) | the middle value between the median and the highest value of the set of values |
| statistic field | explanation |
|---|---|
| Mean | the expected value or average |
| Standard deviation | measure of the amount of variation of a set of values |
| Minimum value | |
| Maximum value | |
| 25% of records are below this value(limit value of the first quartile) | the middle number between the smallest number and the median of the set of values |
| 50% of records are below this value (median) | the median of the set of values / 50% of the data lies below this point |
| 75% of records are below this value (limit value of the third quartile) | the middle value between the median and the highest value of the set of values |
| Outlier upper bound | mean + 3 * standard deviation |
| Outlier lower bound | mean - 3 * standard deviation |
| Total number of outliers (outside 3 std.dev) | |
| Rows with outliers | list of [row number, value] |
| statistic field | explanation |
|---|---|
| Mode | most frequent value in the column |
| Number of occurrences for the most frequent value (mode) | |
| Minimum value | |
| Maximum value |
| statistic field | explanation |
|---|---|
| Count of unique values (for text variables) | |
| Most frequent value | most frequent value in the column |
| Number of occurrences for most frequent value | |
| 5 most frequent values | |
| 5 least frequent values |
| statistic field | explanation |
|---|---|
| Most frequent value | most frequent value in the column |
| Number of occurrences for most frequent value | |
| List of category values | |
| Number of categories |
If there are datatype violations and the tool succeeds to find corrections for some of them, then the suggested corrections are presented in a table with the following structure.
| invalid value | proposed correction |
|---|
If there are constraint violations and the tool succeeds to find corrections for some of them, then the suggested corrections are presented in a table with the following structure.
| invalid value | proposed correction |
|---|
The invalid values that the tool has not been able to propose any corrections, will be replaced with null. Those invalid values are presented in this section in a list.
The report has 3 major sections:
- Section with general information about the tool and execution parameters.
- Section with General and Validation Statistics
- Section with of MRI protocols
Here is the template of the MRI Sequences Report along with some comments and explanations.
This Report is created on: * Date the report is created* Version of Data Quality Control Tool: Version of the tool Main folder path where DICOM files are stored: folder path of the Dicom Root Folder Total subfolders scanned: number of subfolders that the tool has scanned and read the including dcm files, note there is an assumption of one subfolder per patient
Total number of patients with valid MRI sequences: Total number of valid MRI sequences: Total number of invalid MRI sequences: Total number of invalid DICOM (.dcm) files:
Here, all the distinct MRI protocols (of all valid and invalid Sequences) are listed in a table. Those protocols are retrieved from the SeriesDescription Dicom tag.
If there are valid sequences then the tool will create this csv file. This file contains all the valid MRI sequences that found in given DICOM folder with the following headers discribing each sequence:
PatientID, StudyID, SeriesNumber, SeriesDescription, SeriesDate
The value of the sequence tags SeriesDescription and SeriesDate are dirived from the headers in the dicom files - more specifically, the value of a sequence tag is the most frequent value of this particular tag found in the sequence's dicom files.
If there are invalid sequences the tool will create this csv file with the following headers:
PatientID, StudyID, SeriesNumber, Slices, Invalid_dicoms, SeriesDescription, Error1, Error2, Error3, Error4, Error5, Error6
-
Slicesis the number of dicom files that the current sequence is consist of (sum of valid and invalid dicoms). -
Invilid_dicomsis the number of invalid dicom files the current sequence. -
Error1-Error6is an error description that explains the reason why the sequence is characterized as 'invalid'
If a dicom file does not have at least one of the mandatory tags as described in the MIP specification found here, then it will be characterized as 'invald'. If there are invalid dicoms in the DICOM dataset, the tool will create this csv file with the following headers:
Folder, File, PatientID, StudyID, SeriesNumber, InstanceNumber, MissingTags
-
MissingTagsis a list of the missing mandatory DICOM tags.
If in the given root folder are some files that the QC tool can not process (not dicom files, corrupted dicom files etc), the tool will create this csv file with the following headers describing the location of those files:
Folder, File
This file contains MRI visit information for each patient. This file is necessary for the HBP MIP DataFactory's Step3_B (not in use any more)and it has the following headers:
PATIENT_ID, VISIT_ID, VISIT_DATE