Description
As noted elsewhere, this is relatively trivial to implement engineering-wise but non-trivial config-wise/spec-wise, and specifically because there isn't and never has been an exact "CSV spec".
"CSV" is really not a single spec
Rather, CSV should be consider to be a family of related formats, which may or may not be comma delimited. Wikipedia article (unintentionally) does a pretty good job of highlighting the challenges here:
From https://en.wikipedia.org/wiki/Comma-separated_values (emphasis mine):
The CSV file format is not fully standardized. Separating fields with commas is the foundation, but commas in the data or embedded line breaks have to be handled specially. Some implementations disallow such content while others surround the field with quotation marks, which yet again creates the need for escaping if quotation marks are present in the data.
The term "CSV" also denotes several closely-related delimiter-separated formats that use other field delimiters such as semicolons.[2] These include tab-separated values and space-separated values. A delimiter guaranteed not to be part of the data greatly simplifies parsing.
Alternative delimiter-separated files are often given a ".csv" extension despite the use of a non-comma field separator. This loose terminology can cause problems in data exchange. Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used instead of commas in many European locales in order to use the comma as the decimal separator and, possibly, the period as a decimal grouping character.
Commonly unsupported options
Examples of classic challenges when using CSV files as an interchange format:
- Ambivalence towards column headers.
- Notoriously, Snowflake doesn't read file headers - it only specifies a number of header rows to skip, and it expects all fields are provided in ordinal number order matching the filespec.
- Newlines triggering "next record" regardless of escaping/quoting.
- Many CSV parsers do not correctly parse newlines when provided as content within provided data. A config-level option for taps may be required to replace newline characters with another string, such as
';'
or''
(empty string). Sometimes it is preferable to replace the newline character with'\n'
, and then the dbt layer or another post-processing step may replace the string'\n'
with an actual newline character.
- Many CSV parsers do not correctly parse newlines when provided as content within provided data. A config-level option for taps may be required to replace newline characters with another string, such as
- Unsupported character encodings.
- This occurs most often when dealing with cross-OS or cross-region interchange. To my knowledge, this is best mitigated by explicitly requiring encoding to be utf-8.
Not re-inventing the wheel
As has been helpful in other contexts, I'll propose we start with an already existing spec for config.
Details
According to ChatGPT these are the most popular ways to process CSV files with Python:
- csv.reader(): This is a built-in Python library that provides a fast and efficient way to read CSV files. It returns an iterator that can be used to iterate over the rows in the CSV file. This is the most popular method of reading CSV files in Python.
- pandas.read_csv(): This is a popular method of reading CSV files in Python. It uses the pandas library, which provides high-performance, easy-to-use data structures and data analysis tools. It returns a DataFrame object that can be used to manipulate the data.
- numpy.loadtxt(): This is another popular method of reading CSV files in Python. It uses the numpy library, which provides a powerful array computing and linear algebra toolset. It returns an array object that can be used to manipulate the data.
- csv.DictReader(): This is similar to csv.reader(), but it returns a dictionary object for each row, with the keys being the column headers and the values being the row data. This can be useful if you need to access specific columns of the data.
Essentially, this would point us towards one of three libraries csv
, pandas
, or numpy
. The csv
library is the only one shipped with standard libraries. The pandas
and numpy
libraries would each require an additional import (and each with their own dependencies).
If we prioritize serial and incremental reading/writing, I think csv.reader()
or csv.DictReader()
makes good sense, with csv.reader()
likely having better performance, and csv.DictReader()
having an advantage of pre-processing the dataset into dict
objects, which we need to do anyway.
If we prioritize flexibility or configuration, the pandas
config options may be more robust and expressive overall. The pandas
library also has pre-built dialects for Excel
and other common formats. What I don't know (personally) is whether pandas
's prioritization of in-memory analytics would introduce any penalty versus other serial one-record-in-one-record-out
methods which may have a lower RAM footprint.
Another disadvantage of using pandas
is that we then need to ship potentially all SDK taps and targets with the pandas
library in order to have true universal interop.
UPDATE: Proposal using polars as config base here: #1584 (comment)
Distinction of config dialect vs actual implementation details.
We can optionally decouple the configuration dialect from how the files actually get processed - although pragmatically speaking, it is easier and more stable if we keep these aligned, at least for the default SDK implementations.
It's also worth calling out that, whatever configuration dialect we chose, native processors will need a translation step. So, if we chose the Pandas config dialect, for instance, which accepts an "Excel"
dialect, any source or target receiving this instruction will have to translate this config detail into some set of config parameters that the connected system can understand.
Raising an exception for unsupported dialect options
We likely would need to introduce a new exception class which would be used specifically for the purpose of raising and catching cases where config options are unexpected or unhandled. The raising of this exception might not fail the entire sync operation, but rather this could trigger base-level SDK-driven processors...
Graceful failover to SDK processing in case of unsupported dialects
Of course, if a connected system like Snowflake or Redshift cannot natively understand the dialect options that the user provides, we may have to send these processing directives to the SDK-backed native processors. In this way, we can guarantee that any CSV file sent within our range of expected config options will be successfully processed, even if it cannot be understood natively by the source or target system.
While this graceful failover is nice from a usability perspective, it may produce unexpected effects performance-wise. If the user expects a CSV to be natively processed by target-snowflake
, for instance, but the specific dialect is not natively supported by Snowflake (or the dev's Snowflake implementation), then records will be loaded much slower than the end-user might otherwise expect. Which leads us to the next point, regarding an end-user option to fail when native processing is expected but not actually available...
Config option to restrict to native batch-processing only
To mitigate the above, it might be worthwhile to introduce something like a "batch_config": {"require_native_processor": true}
config that gives users the ability to fail a pipeline if the config dialect cannot be natively processed by the upstream or downstream system. While this is especially needed for CSV, which has an almost infinite cross-product of dialect options, it also could apply to JSONL
and parquet
parsing - wherein the tap or target could abort rather than go through the SDK-backed conversion of records to files, which will always be slower than a native bulk export or bulk import operation.
Activity