Skip to content

Report Files Descriptions and Details

Iosif Spartalis edited this page Sep 30, 2021 · 1 revision

Data Validation Report of a CSV file (PDF)

The Data Validation report has 2 major report types:

  1. Report with dataset's overall statistics and rows percentages about data completion and validation.
  2. Report for each dateset column with statistics about the values, validation results and suggestions for correcting the invalid data.

If Data Cleaning has been performed, then both reports are slightly different from the original ones, containing additional information about the data cleaning operation.

Here are the templates of those 2 types of reports with explanations for each subsection

Dataset's Report

Section with dataset's overall statistics

This QC Report is created on: Date we run the data validation

Version of Data Quality Control Tool: version

File path: the file path location of the dataset csv file

Total number of rows: integer, Total number of columns: integer

Is metadata JSON file with table schema provided? Yes/No

Section with data validation and data completeness information

Data Cleaning performed? Yes/No

Number of rows with invalid values: the total number of rows having at least one column with invalid data

Rows Data Completeness Overall Statistics Table with distribution of number of rows against number of filled columns per row

Rows Data Validation Overall Statistics Table with distribution of number of rows against number of filled -with valid data- columns per row

Column's Report

Table with basic information about the column

Type Text/Integer/Numerical/Nominal/Date
Total number of rows column size in rows
Number of rows with data how many filled values the column has
Completion percentage The % ration of (filled values/column size)
Number of rows with constrain violation how many rows contain a value that violates a specified value restriction (like min, max, enum etc) which is described in the dateset schema
Number of rows with datatype violation how many rows contain a value that its type differs from the type specified for this particular column in the dataset's schema.
Data Cleansing applied? Yes/No

Table with statistics about the column values

The statistics vary depending on the column datatype.

Statistic table for Integer Datatype

statistic field explanation
Mode most frequent value in the column
Number of occurrences for the most frequent value (mode)
Minimum value
Maximum value
25% of records are below this value(limit value of the first quartile) the middle number between the smallest number and the median of the set of values
50% of records are below this value (median) the median of the set of values / 50% of the data lies below this point
75% of records are below this value (limit value of the third quartile) the middle value between the median and the highest value of the set of values

Statistic table for Numerical Datatype

statistic field explanation
Mean the expected value or average
Standard deviation measure of the amount of variation of a set of values
Minimum value
Maximum value
25% of records are below this value(limit value of the first quartile) the middle number between the smallest number and the median of the set of values
50% of records are below this value (median) the median of the set of values / 50% of the data lies below this point
75% of records are below this value (limit value of the third quartile) the middle value between the median and the highest value of the set of values
Outlier upper bound mean + 3 * standard deviation
Outlier lower bound mean - 3 * standard deviation
Total number of outliers (outside 3 std.dev)
Rows with outliers list of [row number, value]

Statistic table for Date Datatype

statistic field explanation
Mode most frequent value in the column
Number of occurrences for the most frequent value (mode)
Minimum value
Maximum value

Statistic table for Text Datatype

statistic field explanation
Count of unique values (for text variables)
Most frequent value most frequent value in the column
Number of occurrences for most frequent value
5 most frequent values
5 least frequent values

Statistic table for Nominal Datatype

statistic field explanation
Most frequent value most frequent value in the column
Number of occurrences for most frequent value
List of category values
Number of categories

Suggested corrections for datatype violations

If there are datatype violations and the tool succeeds to find corrections for some of them, then the suggested corrections are presented in a table with the following structure.

invalid value proposed correction

Suggested corrections for constraint violations

If there are constraint violations and the tool succeeds to find corrections for some of them, then the suggested corrections are presented in a table with the following structure.

invalid value proposed correction

Values that will be replaced with null

The invalid values that the tool has not been able to propose any corrections, will be replaced with null. Those invalid values are presented in this section in a list.

DICOM MRI Metadata report files

dicom_report.pdf - DICOM Metadata Validation Report

The report has 3 major sections:

  1. Section with general information about the tool and execution parameters.
  2. Section with General and Validation Statistics
  3. Section with of MRI protocols

Here is the template of the MRI Sequences Report along with some comments and explanations.

General Information Section

This Report is created on: * Date the report is created* Version of Data Quality Control Tool: Version of the tool Main folder path where DICOM files are stored: folder path of the Dicom Root Folder Total subfolders scanned: number of subfolders that the tool has scanned and read the including dcm files, note there is an assumption of one subfolder per patient

General and Validation Statistics Section

Total number of patients with valid MRI sequences: Total number of valid MRI sequences: Total number of invalid MRI sequences: Total number of invalid DICOM (.dcm) files:

MRI protocols Section

Here, all the distinct MRI protocols (of all valid and invalid Sequences) are listed in a table. Those protocols are retrieved from the SeriesDescription Dicom tag.

validsequences.csv

If there are valid sequences then the tool will create this csv file. This file contains all the valid MRI sequences that found in given DICOM folder with the following headers discribing each sequence:

PatientID, StudyID, SeriesNumber, SeriesDescription, SeriesDate

The value of the sequence tags SeriesDescription and SeriesDate are dirived from the headers in the dicom files - more specifically, the value of a sequence tag is the most frequent value of this particular tag found in the sequence's dicom files.

invalidsequences.csv

If there are invalid sequences the tool will create this csv file with the following headers:

PatientID, StudyID, SeriesNumber, Slices, Invalid_dicoms, SeriesDescription, Error1, Error2, Error3, Error4, Error5, Error6

  • Slices is the number of dicom files that the current sequence is consist of (sum of valid and invalid dicoms).
  • Invilid_dicoms is the number of invalid dicom files the current sequence.
  • Error1 - Error6 is an error description that explains the reason why the sequence is characterized as 'invalid'

invaliddicoms.csv

If a dicom file does not have at least one of the mandatory tags as described in the MIP specification found here, then it will be characterized as 'invald'. If there are invalid dicoms in the DICOM dataset, the tool will create this csv file with the following headers:

Folder, File, PatientID, StudyID, SeriesNumber, InstanceNumber, MissingTags

  • MissingTags is a list of the missing mandatory DICOM tags.

notprocessed.csv

If in the given root folder are some files that the QC tool can not process (not dicom files, corrupted dicom files etc), the tool will create this csv file with the following headers describing the location of those files:

Folder, File

mri_visits.csv

This file contains MRI visit information for each patient. This file is necessary for the HBP MIP DataFactory's Step3_B (not in use any more)and it has the following headers:

PATIENT_ID, VISIT_ID, VISIT_DATE

Clone this wiki locally