The purpose of this API is to validate CSV files for compliance with established norms such as RFC4180. Think of it like an easy to use CSV linter.
Comma-separated values (CSV) is commonly used for exchanging data between systems. While this format is common, it can present difficulties. Why? Different tools, or export processes, often generate outputs that are not CSV files or have variations that are not considered "valid" according to the RFC4180.
Invalid CSV files create challenges for those building data pipelines. Pipelines value consistency, predictability, and testability as they ensure uninterrupted operation from source to a target destination.
Imagine pouring a gallon of maple syrup into the gas tank of your car. That is what bad CSV files do to data pipelines. If an error can get trapped earlier in the process, it improves operations for all systems.
This API will assist users with determining the quality of CSV data prior to delivery to upstream data pipelines. It will also generate a schema for the tested file, which can further aid in validation workflows.
Here is an example of a valid CSV file.
It has a header row with foo
, bar
, and buzz
with a corresponding row of aaa
, bbb
, and ccc
foo | bar | buzz |
---|---|---|
aaa | bbb | ccc |
The CSV will look something like this;
foo,bar,buzz
aaa,bbb,ccc
However, what if one day something changed. The file now looks like this:
foo | bar | buzz | |
---|---|---|---|
aaa | zzzz | bbb | ccc |
foo,bar,buzz
aaa,zzz,bbb,ccc
So what is wrong with this? RFC 4180 says that;
Within the header and each record, there may be one or more fields, separated by commas. Each line should contain the same number of fields throughout the file. Spaces are considered part of a field and should not be ignored. The last field in the record must not be followed by a comma.
Notice the additional zzz
is now between aaa
and bbb
. This file would marked as invalid because this of this misalignment. Is zzz
correct? What about ccc
? Maybe there is a missing header value for ccc
? Regardless, the file has some issues.
For other examples, please take a look at the RFC document for guidance on proper formats used for files using Comma-Separated Values (CSV).
There are two steps to the validation process. The first step is to post the file(s) to the API. If the file was accepted, the API will return a polling endpoint that contains the results of the validation process.
In our example, we will assume you have a CSV file called your.csv
that you want to test.
The API has a 10 MB limit per file posted. See our client code samples how on how to manage for this limit (hint: split large files). Also, a header row is required in your CSV for validation to work correctly.
We will run a simple curl
command that will send the data to the API. It will look like this:
curl -F 'file=@/path/to/your.csv' 'https://validation.openbridge.io/dryrun' -H 'Content-Type: multipart/form-data' -D -
For the sake of this example, we will assume that the file posted successfully. You will see a HTTP/2 302
response like this:
HTTP/2 302
content-type: application/json
content-length: 2
location: https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac
date: Fri, 15 Feb 2019 22:36:57 GMT
x-amzn-requestid: 33316acc-3172-11e9-b824-ddbb1dd27279
x-amz-apigw-id: VKbJWGPtIAMF5Wg=
x-amzn-trace-id: Root=1-5c673f07-252f6e91546d64f019526eb6;Sampled=0
x-cache: Miss from cloudfront
via: 1.1 ab6d050b627b51ed631842034b2e298b.cloudfront.net (CloudFront)
x-amz-cf-id: XLW--LDgeqe7xpLvuoKxaGvKfvNcB4BNkyvx62P99N3qfgeuAvT7EA==
Note the location
URL endpoint:
https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac
This is the your polling endpoint. You use this to obtain the results from the validation process. You can use curl
again:
curl -s -w "%{http_code}" "https://validation.openbridge.io/dryrun/poll_async/99ax58f2020423v28c6e644cd143cdac"
If the file was properly formatted, with no errors, you will get a HTTP/2 200
response from your polling endpoint. This indicates success! As a bonus, it will also provide an inferred schema. In our your.csv
file the schema looks like this:
{"data": {"rules": {"configuration": {"load": {"prepend_headers": false, "schema": {"fields": [{"default": null, "include_in_checksum": true, "name": "ob_mws_seller_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "ob_mws_marketplace_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_name", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_description", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "listing_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "seller_sku", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "price", "type": "DOUBLE PRECISION"}, {"default": null, "include_in_checksum": true, "name": "quantity", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "open_date", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "image_url", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_is_marketplace", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "product_id_type", "type": "BIGINT"}, {"default": null, "include_in_checksum": true, "name": "zshop_shipping_fee", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_note", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "item_condition", "type": "BIGINT"}, {"default": null, "include_in_checksum": true, "name": "zshop_category1", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_browse_path", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_storefront_feature", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin1", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin2", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "asin3", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "will_ship_internationally", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "expedited_shipping", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "zshop_boldface", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "product_id", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "bid_for_featured_placement", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "add_delete", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "pending_quantity", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "fulfillment_channel", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": true, "name": "merchant_shipping_group", "type": "VARCHAR (1024)"}, {"default": null, "include_in_checksum": false, "name": "ob_transaction_id", "type": "varchar(256)"}, {"default": null, "include_in_checksum": false, "name": "ob_file_name", "type": "varchar(2048)"}, {"default": null, "include_in_checksum": false, "name": "ob_processed_at", "type": "varchar(256)"}, {"default": "getdate()", "include_in_checksum": false, "name": "ob_modified_date", "type": "datetime"}]}}}, "destination": {"tablename": "sample"}, "dialect": {"__doc__": null, "__module__": "csv", "_name": "sniffed", "delimiter": ",", "doublequote": true, "encoding": "UTF-8", "lineterminator": "\r\n", "quotechar": "\"", "quoting": 0, "skipinitialspace": false}, "meta": {"creation_date": null, "version": null}}}}
Congratulations, the file passed validation and generated a schema for you!
We have provided a couple of client scripts for you to use as-is or modify as need.
The first is a BASH
client. It will take a CSV as an argument, then split the file into chunks for processing. Each chunk is sent to the API for testing. This makes the process more atomic, rather than one large file it can test smaller chunks. If there is an error, it will fail on a chunk, allowing you to more easily isolate a problem.
bash -c './validation_client.sh path/to/your.csv'
The PYTHON
client works the same way, pass it file via -f
and it will split and send chunks, just like the BASH
client.
python ./validation_client.py -f path/to/your.csv
If your file is comma delimited, confirm that there are no extra unquoted commas in your data. Unquoted commas in your data will be treated as a field delimiter and the result will be that your data is ‘shifted’ and an extra column(s) created.
In the example below, there is an unquoted comma (Openbridge, Inc.) in the second row of the file named company_sales_20190207.csv
that is being loaded to a warehouse table named company_sales that has been defined with 3 fields: comp_name (string)
, comp_no (integer)
and sales_amt (decimal)
File: company_sales_20190207.csv
comp_name,comp_no,sales_amt
Acme Corp.,100,500.00
Openbridge,Inc., 101, 100.00
comp_name | comp_no | sales_amt |
---|---|---|
Acme Corp. | 100 | 500.00 |
Openbridge,Inc. | 101 | 100.00 |
A typical error for this use case would be Error: could not convert string to integer”
As you can see, the unquoted comma in the ‘Openbridge, Inc.’
value is treated as a delimiter and text after the comma is ‘shifted’ into the next column (comp_no). In this example, the file will fail because there will be a field type mismatch between the value ‘Inc.’ (string) and the table field type (integer).
The first step to resolving this failure is to determine which field value(s) have the unquoted comma. Excel can help with this as is will place the field values in spreadsheet columns based on the comma delimiter. Once opened in Excel, you can filter one of the columns to identify values that do not belong (e.g. if you see a value of ‘openbridge’ in a field named sales_amt that should have values with a decimal field type.
Once identified, there are a couple options for ‘fixing’ this data:
- Surround the field value in double quotes (e.g.
“Openbridge, Inc.”
). This will tell the system that the comma is part of the field value and not a delimiter. As long as the field in question is defined properly (in this case as a string) the data for that row will be successfully loaded. - You can also remove the comma in the field value (e.g. change the value
‘Openbridge,Inc.’
to‘Openbridge Inc.
). While viable, the correct approach is to use the double quotes in (1).
Once this issue is resolved in your file, re-post the file to the API for testing.
As described in the previous use case, quotation marks can be used to surround values with commas to prevent them from being treated as delimiters. However, the use of only one quotation mark can be problematic because the system will consider everything after that quotation mark until the end of the row as part of the field and the file will fail to load.
In the example below, there is a single quotation mark in the second row of the file named company_sales_20190208.csv
that is being loaded to a warehouse table named company_sales
.
File: company_sales_20190208.csv
Table: company_sales
comp_name,comp_no,sales_amt
Acme Corp.,100,500.00
Openbridge,Inc.,“101,100.00
comp_name (string) | comp_no (integer) | sales_amt (decimal) |
---|---|---|
Acme Corp. | 100 | 500.00 |
Openbridge,Inc. | 101,100.00 |
As you can see, the single quotation mark “101
caused the system to treat the rest of the record as the value for the comp_no field and the record will once again fail to load because the data types do not align.
There are a couple options to resolve this issue depending on whether the quotation marks were intended to surround a value or included in error…
- If quotation marks were intended, add the closing quotation mark at the end of the respective value
- If the single quotation mark was added in error, delete it
Once this issue is resolved in your file, re-post the file to the API.
Sometimes source systems have leading or trailing spaces in data fields. According to RFC410, Spaces are considered part of a field and should not be ignored.
As a result, we will not treat spaces as a failure in our validation tests as it conforms to specifications.
However, there are potential issues with data containing leading or trailing spaces. Specifically, problems can arise if there are empty or null values included in a file. If those null values include spaces, a system may treat those values as strings which may not be compatible with the field type defined in your destination table (i.e. integer).
Another potential issue is that spaces can cause issues with joins between tables in your warehouse. If trimming spaces from your source data is not possible, you can typically remove those spaces by creating views in your warehouse using TRIM
functions.
In many cases you will want to remove (trim) these spaces prior to posting.
In the example below, there is an empty value that includes spaces for the field sales_amt
in the second row of the file named company_sales_20190209.csv
that is being loaded to a warehouse table named company_sales
. That record will fail because the string field type associated with the spaces does not match the decimal field type for the table definition.
File: company_sales_20190209.csv
Table: company_sales
comp_name,comp_no,sales_amt
Acme Corp.,100,500.00
Openbridge Inc., 101, ,
comp_name (string) | comp_no (integer) | sales_amt (decimal) |
---|---|---|
Acme Corp. | 100 | 500.00 |
Openbridge Inc. | 101 | ‘ ‘ |
There are a couple options to resolve this issue depending on whether the null value included in the file is a valid value or not.
- If the value is valid, remove the spaces in the field value to indicate a true Null value. A field with spaces is not Null.
- Replace the value with a valid field value that matches the field data type in the table (in our example a decimal)
It is possible that your data types will change over time. This means your old and new schemas do not have compatible field types defined in your destination table.
Lets say you attempt to process a file where all values for a column called sales_amt
were an INT data type, yet historical data for sales_amt
were a decimal data type. This will cause donwstream processing failures. A system will have a schema setting sales_amt
as an INT and you are trying to process and load older data with a decimal data type.
In the example below, there is a value that includes the field sales_amt
of a file named company_product_today.csv
that is being loaded to a warehouse table named company_sales
. Here is an example of a file that is current that is sent every day:
- File:
company_product_today.csv
- Table:
company_product
Notice the sales_amt
is an INT:
comp_name,comp_no,sales_amt
Acme Corp.,100,500
Openbridge Inc.,101,300
comp_name (string) | comp_no (integer) | sales_amt (integer) |
---|---|---|
Acme Corp. | 100 | 500 |
Openbridge Inc. | 101 | 300 |
Now, lets say you have historical data from a few years ago you also want to load into the same destination. However, the sales_amt
was not an INT back then, it was a decimal:
File: company_product_history.csv
Table: company_product
comp_name,comp_no,sales_amt
Acme Corp.,100,500.00
Openbridge Inc.,101,300.00
Wiley,103,1200.00
comp_name (string) | comp_no (integer) | sales_amt (decimal) |
---|---|---|
Acme Corp. | 100 | 500.00 |
Openbridge Inc. | 101 | 300.00 |
Wiley | 103 | 1200.00 |
When you attempt to load company_product_history.csv
it will fail as a result of this data type mismatch. That sales_amt
will fail because the string field type associated with the spaces does not match the decimal field type for the table company_product
definition.
There are a couple options to resolve this issue:
- Rather than send historical data to the same table, send it to an alternate table and then use a view to fuse them together.
- Use the API to check all of your current and hisotrical data. This will allows you to model the data so you can determine a common data type that would work across all variations.
- Status Code:
HTTP/2 302; Success - Processing has begun
- Status Code:
HTTP/2 404; Failure, no sample file was provided
- Status Code:
HTTP/2 400; Failure, the rules API was unable to initialize the request for some reason
- Status Code:
HTTP/2 302; Pending - still processing
- Status Code:
HTTP/2 200; Success - processing completed, file validated successfully
- Status Code:
HTTP/2 502; Failure - file determined to be invalid by rules API
- Status Code:
HTTP/2 404; Failure - invalid request ID in polling URL (expired, etc.)
If you have any problems with or questions about this image, please contact us through a GitHub issue.
You are invited to contribute new features, fixes, or updates, large or small; we are always thrilled to receive pull requests, and do our best to process them as fast as we can.
Before you start to code, we recommend discussing your plans through a GitHub issue, especially for more ambitious contributions. This gives other contributors a chance to point you in the right direction, give you feedback on your design, and help you find out if someone else is working on the same thing.