-
Couldn't load subscription status.
- Fork 4
Data Validation & Cleaning functionality (CSV)
The tool can perform data validation and cleaning to the following datatypes:
- numerical
- integer
- date
- nominal
We categorize the violations into two types:
- Constraint violations
- Datatype violations
Types of Constraint violations that are supported currently by the tool are:
- minimum (for date, integer, numerical)
- maximum (for date, integer, numerical)
- enum (list of enumerations for nominal datatypes)
Datatype violation is the case when a value in a column, has a different datatype from the one that has been declared for that column in the dataset's schema json file.
Below we can see the constraint violations per datatype that currently are supported by the tool, and the corresponding suggestions for the data cleaning operation. Also, in another table, we can see the datatype violations per datatype and the corresponding suggestions for the data cleaning operation.
| Constraint | Suggested replacement |
|---|---|
| Minimum | -> Null |
| Maximum | -> Null |
| wrong Datatype | Suggested replacement |
|---|---|
| date | -> Null |
| text | -> Null |
| Constraint | Suggested replacement |
|---|---|
| Minimum | -> Null |
| Maximum | -> Null |
| wrong Datatype | Suggested replacement |
|---|---|
| numerical(float) | -> integer(numerical) |
| date | -> Null |
| text | -> Null |
| Constraint | Suggested replacement |
|---|---|
| Minimum | -> Null |
| Maximum | -> Null |
| wrong Datatype | Suggested replacement |
|---|---|
| numerical | -> Null |
| integer | -> Null |
| text | try infer(Date) else Null |
| Constraint | Suggested replacement |
|---|---|
| enum | try spell-correction* else Null |
*We calculate the levenshtein distances between the given mis-pelled value with all the enumerations declared in the data schema. We suggest as corrected value the enumeration with distance smaller or equal to 3.
| wrong Datatype | Suggested replacement |
|---|---|
| numerical | -> Null |
| integer | -> Null |