Add automatic table verification based on column types #92
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add automatic table verification based on column types.
Description
Comparing changes made to the OSG jobs ingestor was tedious. This change improves the ability to compare table data when making changes to ETL actions. When comparing data, we can't simply compare the number of columns, column names, data types, and exact values. In some cases columns were added or removed, data types may have changed, values may or may not be nullable, and there may be rounding errors or MySQL may be using an approximate-value numeric. The following options are now available:
--coalesce-column
forces columns to be coalesced to a value (default 0) and improves the ability to compare columns that may be null in the source and/or destination tables--truncate-column
truncates decimal values (default 0 places) eliminating rounding errors--pct-error-column
compares the values of 2 columns by calculating the percent error and ensuring it is less than a threshold (default 0.01)--autodetect-column-comparison
Attempt to automatically determine the right combination of options based on the source and destination column type and whether or not they are nullable. (1) If either column is nullable, the values need to be coalesced to a non-null value before comparing and (2) If at least one column is a double/float/decimal and differs from the other column, truncate the columns before comparison and add the percent-difference calculation. Often times, the value of a double that has been calculated in an aggregate function may differ after several digits in the mantissa or MySQL may use scientific notation to show an approximate-value numeric literal.For example, the following command will compare
federated_osg_baseline.jobfact_by_day
tofederated_osg_etltest.jobfact_by_day
and will only compare columns present in the source table. It will ignore the fact that there are additional columns in the destination table and will ignore the fact that some column types were changed fromdouble
todecimal(36,4)
but will take measures so exact decimal numbers are not needed for the comparison to succeed.The following query is generated
Help
Motivation and Context
Life is too short to have to do table comparisons manually.
Tests performed
Compared 3 months of OSG data ingested using the original OsgJobsIngestor class to the new JSON configuration where
double
columns were changed todecimal(36,4)
and there are known rounding differences and differences in null values vs. 0.Types of changes
Checklist: