Add support for CSV encoding for BATCH messaging

As noted elsewhere, this is relatively trivial to implement engineering-wise but non-trivial config-wise/spec-wise, and specifically because there isn't and never has been an exact "CSV spec".

## "CSV" is really not a single spec

Rather, CSV should be consider to be a family of related formats, which may or may not be comma delimited. Wikipedia article (unintentionally) does a pretty good job of highlighting the challenges here:

From https://en.wikipedia.org/wiki/Comma-separated_values (emphasis mine):

> **_The CSV file format is not fully standardized._** Separating fields with commas is the foundation, but commas in the data or embedded [line breaks](https://en.wikipedia.org/wiki/Newline) have to be handled specially. **_Some implementations disallow such content_** while others surround the field with [quotation marks](https://en.wikipedia.org/wiki/Quotation_mark), which yet again creates the need for escaping if quotation marks are present in the data.
> 
> **_The term "CSV" also denotes several closely-related [delimiter-separated formats](https://en.wikipedia.org/wiki/Delimiter-separated_values) that use other field delimiters such as semicolons.[[2]](https://en.wikipedia.org/wiki/Comma-separated_values#cite_note-2)_** These include [tab-separated values](https://en.wikipedia.org/wiki/Tab-separated_values) and space-separated values. A delimiter guaranteed not to be part of the data greatly simplifies [parsing](https://en.wikipedia.org/wiki/Parsing).
> 
> **_Alternative delimiter-separated files are often given a ".csv" [extension](https://en.wikipedia.org/wiki/Filename_extension) despite the use of a non-comma field separator. This loose terminology can cause problems in [data exchange](https://en.wikipedia.org/wiki/Data_exchange)._** Many applications that accept CSV files have options to select the delimiter character and the quotation character. Semicolons are often used instead of commas in many European [locales](https://en.wikipedia.org/wiki/Locale_(computer_software)) in order to use the comma as the decimal separator and, possibly, the period as a decimal grouping character.

## Commonly unsupported options

Examples of classic challenges when using CSV files as an interchange format:

1. **Ambivalence towards column headers.**
   - Notoriously, Snowflake doesn't read file headers - it only specifies a number of header rows to skip, and it expects all fields are provided in ordinal number order matching the filespec.
3. **Newlines triggering "next record" regardless of escaping/quoting.**
   - Many CSV parsers do not correctly parse newlines when provided as content within provided data. A config-level option for taps may be required to replace newline characters with another string, such as  `';'` or `''` (empty string). Sometimes it is preferable to replace the newline character with `'\n'`, and then the dbt layer or another post-processing step may replace the string `'\n'` with an _actual_ newline character.
1. **Unsupported character encodings.**
   - This occurs most often when dealing with cross-OS or cross-region interchange. To my knowledge, this is best mitigated by explicitly requiring encoding to be utf-8.

## Not re-inventing the wheel

As has been helpful in other contexts, I'll propose we start with an already existing spec for config. 

<details>
<summary>Details</summary>


According to ChatGPT these are the most popular ways to process CSV files with Python:

> 1. csv.reader(): This is a built-in Python library that provides a fast and efficient way to read CSV files. It returns an iterator that can be used to iterate over the rows in the CSV file. This is the most popular method of reading CSV files in Python.
> 1. pandas.read_csv(): This is a popular method of reading CSV files in Python. It uses the pandas library, which provides high-performance, easy-to-use data structures and data analysis tools. It returns a DataFrame object that can be used to manipulate the data.
> 1. numpy.loadtxt(): This is another popular method of reading CSV files in Python. It uses the numpy library, which provides a powerful array computing and linear algebra toolset. It returns an array object that can be used to manipulate the data.
> 1. csv.DictReader(): This is similar to csv.reader(), but it returns a dictionary object for each row, with the keys being the column headers and the values being the row data. This can be useful if you need to access specific columns of the data.

Essentially, this would point us towards one of three libraries `csv`, `pandas`, or `numpy`. The `csv` library is the only one shipped with standard libraries. The `pandas` and `numpy` libraries would each require an additional import (and each with their own dependencies).

If we prioritize serial and incremental reading/writing, I think `csv.reader()` or `csv.DictReader()` makes good sense, with `csv.reader()` likely having better performance, and `csv.DictReader()` having an advantage of pre-processing the dataset into `dict` objects, which we need to do anyway. 

If we prioritize flexibility or configuration, the `pandas` config options may be more robust and expressive overall. The `pandas` library also has pre-built dialects for `Excel` and other common formats. What I don't know (personally) is whether `pandas`'s prioritization of in-memory analytics would introduce any penalty versus other serial `one-record-in-one-record-out` methods which may have a lower RAM footprint.

Another disadvantage of using `pandas` is that we then need to ship potentially _all_ SDK taps and targets with the `pandas` library in order to have true universal interop.

</details>

UPDATE: Proposal using polars as config base here: https://github.com/meltano/sdk/issues/1584#issuecomment-1499263096

## Distinction of config dialect vs actual implementation details.

We can optionally decouple the _configuration_ dialect from how the files actually get processed - although pragmatically speaking, it is easier and more stable if we keep these aligned, at least for the default SDK implementations.

It's also worth calling out that, whatever configuration dialect we chose, native processors will need a translation step. So, if we chose the Pandas config dialect, for instance, which accepts an `"Excel"` dialect, any source or target receiving this instruction will have to translate this config detail into some set of config parameters that the connected system can understand.

## Raising an exception for unsupported dialect options

We likely would need to introduce a new exception class which would be used specifically for the purpose of raising and catching cases where config options are unexpected or unhandled. The raising of this exception might not fail the entire sync operation, but rather this could trigger base-level SDK-driven processors...

## Graceful failover to SDK processing in case of unsupported dialects

Of course, if a connected system like Snowflake or Redshift cannot natively understand the dialect options that the user provides, we may have to send these processing directives to the SDK-backed native processors. In this way, we can guarantee that any CSV file sent within our range of expected config options _will_ be successfully processed, even if it cannot be understood natively by the source or target system.

While this graceful failover is nice from a usability perspective, it may produce unexpected effects performance-wise. If the user expects a CSV to be natively processed by `target-snowflake`, for instance, but the specific dialect is not natively supported by Snowflake (or the dev's Snowflake implementation), then records will be loaded much slower than the end-user might otherwise expect. Which leads us to the next point, regarding an end-user option to fail when native processing is expected but not _actually_ available...

## Config option to restrict to native batch-processing only

To mitigate the above, it might be worthwhile to introduce something like a `"batch_config": {"require_native_processor": true}` config that gives users the ability to fail a pipeline if the config dialect cannot be natively processed by the upstream or downstream system. While this is especially needed for CSV, which has an almost infinite cross-product of dialect options, it also could apply to `JSONL` and `parquet` parsing - wherein the tap or target could abort rather than go through the SDK-backed conversion of records to files, which will always be slower than a native bulk export or bulk import operation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for CSV encoding for BATCH messaging #1584

aaronsteers
openedon Apr 5, 2023

"CSV" is really not a single spec

Commonly unsupported options

Not re-inventing the wheel

Distinction of config dialect vs actual implementation details.

Raising an exception for unsupported dialect options

Graceful failover to SDK processing in case of unsupported dialects

Config option to restrict to native batch-processing only

Assignees

Labels

Type

Projects

Milestone

Relationships

Development