Skip to content

Data Validation & Cleaning functionality (CSV)

Iosif Spartalis edited this page Sep 30, 2021 · 3 revisions

Validation & Cleaning functionality per datatype

The tool can perform data validation and cleaning to the following datatypes:

  • numerical
  • integer
  • date
  • nominal

We categorize the violations into two types:

  1. Constraint violations
  2. Datatype violations

Types of Constraint violations that are supported currently by the tool are:

  1. minimum (for date, integer, numerical)
  2. maximum (for date, integer, numerical)
  3. enum (list of enumerations for nominal datatypes)

Datatype violation is the case when a value in a column, has a different datatype from the one that has been declared for that column in the dataset's schema json file.

Below we can see the constraint violations per datatype that currently are supported by the tool, and the corresponding suggestions for the data cleaning operation. Also, in another table, we can see the datatype violations per datatype and the corresponding suggestions for the data cleaning operation.

Numerical Type column

Constraint Violation

Constraint Suggested replacement
Minimum -> Null
Maximum -> Null

Datatype Violation

wrong Datatype Suggested replacement
date -> Null
text -> Null

Integer Type column

Constraint Violation

Constraint Suggested replacement
Minimum -> Null
Maximum -> Null

Datatype Violation

wrong Datatype Suggested replacement
numerical(float) -> integer(numerical)
date -> Null
text -> Null

Date Type column

Constraint Violation

Constraint Suggested replacement
Minimum -> Null
Maximum -> Null

Datatype Violation

wrong Datatype Suggested replacement
numerical -> Null
integer -> Null
text try infer(Date) else Null

Nominal Type column

Constraint Violation

Constraint Suggested replacement
enum try spell-correction* else Null

*We calculate the levenshtein distances between the given mis-pelled value with all the enumerations declared in the data schema. We suggest as corrected value the enumeration with distance smaller or equal to 3.

Datatype Violation

wrong Datatype Suggested replacement
numerical -> Null
integer -> Null
Clone this wiki locally