TopNotch is a system for quality controlling large scale data sets. It addresses the following three problems:
- How to define and measure data quality
- How to efficiently ensure data quality across many data sets
- How to institutionalize existing knowledge of data sets
TopNotch uses rules to verify individual components of a data set. Each rule defines and measures some small component of data quality. The combination of rules provides a complete definition of and metrics for quality in a data set. The rules can be reused on other data sets to maximize efficiency. Finally, the clear definitions and reuseability of these rules allows users to institutionalize knowledge by documenting a data set.
TopNotch has three requirements:
- Installing SBT, version 0.13.9 or greater
- Installing Spark 1.6
- Setting
$SPARK_HOME
to the top-level folder of the Spark installation.
Follow the below steps to run TopNotch:
- Clone this repo.
- Get the latest JAR either by building this project or by downloading it from the releases portion of TopNotch's GitHub page. Place it in this project's top level bin folder.
- Create the configuration files to test your data set.
- See the example folder for a sample data set and configuration files.
- Run bin/TopNotchRunner.sh with all the configuration files passed as arguments. Ensure that your plan file is the first argument.
- To try the example, run the commands
chmod u+x bin/TopNotchRunner.sh
and thenbin/TopNotchRunner.sh example/plan.json example/assertions.json
. Please note that the example must be run from the root folder of the TopNotch project, as shown in the prior commands.
- To try the example, run the commands
- View the resulting report and parquet file in the topnotch folder in your home directory on HDFS.
- To view the results of the example, look at the JSON file topnotch/exampleAssertionReport and the Parquet file example/exampleAssertionOutput.parquet .
Please note that you must change bin/TopNotchRunner.sh in order to run TopNotch with a master other than local. It is currently recommended that you run TopNotch in local or client mode.
At the top level there are plans. These plans group together commands into a series that is run on one or more data sets. TopNotch quality controls data using three commands: assertions, diffs, and views. Assertions encode rules. Diffs and views transform data sets so that they can be processed by rules.
- Assertion: define and measure metrics of data quality
- Diff: create a new data set by comparing two other data sets with similar schemas
- View: create a new data set by joining or selecting a subset of existing data sets to be used in diffs and assertions
The commands' reusable configuration settings are defined in files separate from the plan. The setting externalParamsFile in the command references that file. See the section Components for a more complete description.
Here is a basic example of TopNotch. It is a plan which contains one command, an assertion.
{
"topnotch": [ {
"command": "assertion",
"externalParamsFile": "testAssertion.json",
"input": {
"ref": "viewKey",
"onDisk": false
},
"outputKey": "exampleAssertionKey",
"outputPath": "/user/durst/topnotch/exampleAssertionOutput.parquet"
} ]
}
{
"topnotch" : {
"assertions" : [ {
"query": "loanBal > 0",
"description": "Loan balances are positive",
"threshold": 0.01
} ]
}
}
Go to http://blackrock.github.io/TopNotch to see the Scaladocs.
- Dataframe: A nested, tabular data set
- Assertion: A rule command. It defines a measure of quality and filters all rows which are invalid according to that metric
- Invalid: A row is considered "invalid" if it does not pass the query clause of an assertion
- Failure: An assertion "fails" if the fraction of rows that it declares to be "invalid" in a data set is greater than a user-specified threshold.
- View: A command that transforms one or more data sets into a single data set against which assertions can be run
- Diff: A command that transforms two data sets into one by joining them on a unique key and then comparing user-specified columns
- Plan: A user-defined combination of assertions, views, and diffs
- Row: A single data point in a dataframe
There are four components to TopNotch. Each component is persisted in JSON format. The following list describes each component in-depth and demonstrates how to write the JSON files for each of these components:
-
The Plan: This runs a series of commands that quality controls data. Subsequent commands in the plan can depend on the outputs of previous ones. The user can have zero, one, or any greater number of operations of each type of views, diffs, and assertions. There must be at least one operation.
- Each command takes at least one input. The view command takes a list of inputs in the JSON field "inputs". The diff command takes two inputs in the fields "input1" and "input2". The assertion command takes one input in the field "input". Each input is either a character-delimited or Parquet file on HDFS or a dataframe in memory that is the output of a previous command in the plan.
- Set "onDisk" to "true" if the input is a file on disk and "false" if it is the result of a previous command.
- Set "delimiter" only if the input is a character-delimited file. In that case, set the option to the file's delimiter.
- Each command outputs a dataframe that can be referred to by later commands. Set the name of this with the "outputKey" value. If the user want this value to be cached for faster access, set "cache" to "true". The "cache" flag is optional and defaults to "false" if not specified.
- Each command can write its output to disk. Set "outputPath" to a path on HDFS if the result of a command is to be persisted. Relative paths will be relative to the user's home directory on HDFS. If "outputPath" is not set, the command's result will not be not persisted.
- For each "externalParamsFile" entry, enter the path to the file relative to the plan. Currently, it is recommended that users place the plan and all associated command files in the same directory.
- For each assertion command, a report is written to the ~/topnotch folder on HDFS in JSON with the name of the command's output key.
{ "topnotch": [ { "command": "view", "externalParamsFile": "testView.json", "inputs": [ { "ref": "topnotch/viewInput.csv", "onDisk": true, "delimiter": "," } ], "outputKey": "viewKey", "cache": true }, { "command": "diff", "externalParamsFile": "testDiff.json", "input1": { "ref": "topnotch/currentLoans.parquet", "onDisk": true }, "input1Name": "cur", "input2": { "ref": "topnotch/oldLoans.parquet", "onDisk": true }, "input2Name": "old", "outputKey": "diffKey", "outputPath": "topnotch/diffOutput.parquet" }, { "command": "assertion", "externalParamsFile": "testAssertion.json", "input": { "ref": "viewKey", "onDisk": false }, "outputKey": "assertionKey", "outputPath": "topnotch/assertionOutput.parquet" } ] }
- Each command takes at least one input. The view command takes a list of inputs in the JSON field "inputs". The diff command takes two inputs in the fields "input1" and "input2". The assertion command takes one input in the field "input". Each input is either a character-delimited or Parquet file on HDFS or a dataframe in memory that is the output of a previous command in the plan.
-
The Assertion Runner: For each assertion command in a plan, this applies a number of assertions to a data set, produces a dataframe containing all the rows declared invalid by any assertion, and creates a summary of how well the data set abides by the assertions run against it.
- The query uses syntax from the where clause of a HiveQL query. Each query (where clause) defines rows that are declared valid. Those not selected by the where clause are declared invalid.
- The format for an assertion json file is:
{ "topnotch" : { "assertions" : [ { "query": "loanBal > 0", "description": "Loan balances are positive", "threshold": 0.01 }, { "query": "loanBal > 1", "description": "Loan balances are greater than 1", "threshold": 0.02 } ] } }
-
The Diff Creator: For diff each command in a plan, this joins two data sets on columns that form a unique key and then compares the values in other columns of the data sets.
- Columns in equal locations in the "joinColumns" arrays are the keys to join the two data sets.
- Columns in equal locations in the "diffColumns" arrays are the columns to be compared.
- There must be at least one set of join columns and another of diff columns. The "joinColumns" arrays must have the same number of elements and so too must the "diffColumns" arrays.
- Columns can be joined and compared even if they have different names because comparisons are determined by positions in the "joinColumns" and "diffColumns" arrays.
- The format for a diff json file is:
{ "topnotch": { "input1Columns": { "joinColumns": [ "loanID", "poolNum" ], "diffColumns": [ "loanBal" ] }, "input2Columns": { "joinColumns": [ "loanIDOld", "poolNumOld" ], "diffColumns": [ "loanBalOld" ] } } }
-
The View Creator: For each view command in a plan, this takes in one or more data sets and produces a single dataframe based on a HiveQL query defined by the user. Use this to transform data into a form against which diffs and assertions can be run .
- The inputs specified in the command file will be loaded as tables with names specified in the "tableAliases" array for the query.
- The result of the query is the command's output.
- The format for a view json file is:
{ "topnotch": { "tableAliases": [ "loanData" ], "query": "select * from loanData" } }
Copyright © 2016 BlackRock, Inc. All Rights Reserved.