ARROW-7759: [C++][Dataset] Add CsvFileFormat #7033

bkietz · 2020-04-24T19:11:38Z

No description provided.

github-actions · 2020-04-24T19:16:44Z

https://issues.apache.org/jira/browse/ARROW-7759

nealrichardson · 2020-04-24T19:59:19Z

@github-actions rebase

cpp/src/arrow/dataset/file_csv.h

fsaintjacques

This is looking great! It can parse parts of the nyc-tlc dataset, schema mismatch related, nothing due to the implementation.

I added 2 missings points:

Some of the ReadOptions are important and can't be passed: skip rows and block size
There is no column projection done. This one shouldn't be hard to implement since we have most of the machinery.

cpp/src/arrow/dataset/file_csv.cc

nealrichardson · 2020-04-28T20:16:59Z

@github-actions rebase

jorisvandenbossche

Cool! Some general remarks:

Similar remark as when parquet options were added: do we plan to use the CsvFileFormat also for writing? (in the long term, since we don't have csv writing functionality right now). For Parquet there is a ReaderOptions that grouped all options related to reading in the ParquetFileFormat.

I also find the fact that some options are exposed directly, and some using the same struct as in the csv module (ParseOptions) a bit confusing / inconsistent.
But I assume this is because the ParseOptions are used in its entirety, and for csv.ConvertOptions and csv.ReadOptions, only part of them is valid to set here? (another option could also be to check that those options are not set and if set raise an error to the user about that)

Do we want to allow reading csv files without a header line with column names? WIth the csv module, you can do eg csv.read_csv("test.csv",read_options=csv.ReadOptions(column_names=["a", "b"])), but this is currently not possible with the dataset interface (you can specify a schema, but that will only work if matching names are in the file)

cpp/src/arrow/dataset/file_csv.h

jorisvandenbossche · 2020-04-29T08:03:53Z

cpp/src/arrow/dataset/file_csv.cc

There might be good reasons for it, but this makes it different than eg parquet, where you also get multithreaded reading when only reading a single file (at least with ToTable).

In addition, it could maybe still be exposed to the user?

Multithreaded reading of a single CSV file without excessive contention in the many file case would require refactoring the CSV readers to expose their parallelism as tasks which could be productively scheduled alongside the ScanTasks. That's a significant change and should certainly wait for a follow up.

Even after we have such an API, the decision to use threading or not would be derived from scan parameters (see ScannerBuilder::UseThreads) and so it doesn't belong in the format.

For my education: the parquet reader has such APIs, and thus there this is not a problem?

use threading or not would be derived from scan parameters (see ScannerBuilder::UseThreads) and so it doesn't belong in the format.

Yep, that's true

Parquet as a format does not require chunking; it is already subdivided into rowgroups. We simply wrap each of those in scan tasks and in this way scan individual files in parallel.

There are several ways we could augment CSV to support something like this: exposing a task interface as I mentioned, including a sidecar file with cached chunk start offsets, using csv::Chunker directly when generating ScanTasks (so each ScanTask wraps a single independently parseable chunk), ...

python/pyarrow/tests/test_dataset.py

bkietz · 2020-04-29T14:00:37Z

@jorisvandenbossche

do we plan to use the CsvFileFormat also for writing? ... For Parquet there is a ReaderOptions that grouped all options related to reading in the ParquetFileFormat

We added ReaderOptions to avoid collisions with extant parquet write options. Since we don't have CSV write options avoiding collisions seems like specualtive generality; this can be handled when/if we add CSV writing. Also note that (for example) ParseOptions::delimiter would also be used when writing

I assume this is because the ParseOptions are used in its entirety, and for csv.ConvertOptions and csv.ReadOptions, only part of them is valid to set here? (another option could also be to check that those options are not set and if set raise an error to the user about that)

That is indeed why I used only ParseOptions whole. It seemed better to me than documenting which fields are ignored or emitting errors when things aren't defaulted.

Do we want to allow reading csv files without a header line with column names?

I can add column_names and autogenerate* too if required, but maybe that could wait for a follow up? Since the way I handled /{Convert,Read}Options/ is evidently not transparent maybe everything except ParseOptions should wait as well?

nealrichardson · 2020-04-29T14:53:27Z

I'll fix the R failures in the next couple of hours (docs need revision).

bkietz · 2020-04-29T19:34:19Z

Reverted everything except parse_options. Follow up: https://issues.apache.org/jira/browse/ARROW-8631

Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche · 2020-04-30T12:14:45Z

Fine to leave the other options as a follow-up. We will want to add them at some point of course. Another option could also be to expose all keyword as direct options (also the ParseOptions keywords), to avoid the discrepancy between ParseOptions as struct vs single options for the others.

bkietz · 2020-04-30T12:52:16Z

I'm happy to implement whatever configuration is agreeable. I'll add a list of the approaches which have been discussed here to the follow-up so we can discuss them there.

fsaintjacques

LGTM, we'll have to review how to pass ConvertOptions in CsvFileFormat because this may be the only option for some users to parse broken CSV, e.g. ignore columns is probably the most important one.

fsaintjacques · 2020-04-30T12:03:35Z

cpp/src/arrow/dataset/file_csv.cc

+  if (scan_options != nullptr) {
+    // This is set to true to match behavior with other formats; a missing column
+    // will be materialized as null.
+    options.include_missing_columns = true;


I would expect the FilterAndProjectScanTask to fix this in a more efficient way.

This is just to stop conversion from erroring if a column is projected but absent from the file. Another way to handle this would be: restrict include_colums to columns present in the file then let the projector handle the rest as you say, but that would require knowledge of the file's columns which we don't have at this stage

fsaintjacques · 2020-04-30T12:38:49Z

cpp/src/arrow/dataset/file_csv.cc

+
+static inline Result<std::shared_ptr<csv::StreamingReader>> OpenReader(
+    const FileSource& source, const CsvFileFormat& format,
+    const std::shared_ptr<ScanOptions>& options = nullptr,


I'd take a ConvertOption instead of a ScanTask.

Not sure what you mean

fsaintjacques · 2020-04-30T12:39:51Z

cpp/src/arrow/dataset/file_csv.cc

+};
+
+Result<bool> CsvFileFormat::IsSupported(const FileSource& source) const {
+  RETURN_NOT_OK(source.Open().status());


Can't you let OpenReader fail?

If source fails to open (for example if it points to a file which doesn't exist) that should raise an error rather than returning false, which is what this line detects.

nealrichardson

Thanks @bkietz !

bkietz requested a review from fsaintjacques April 24, 2020 19:11

nealrichardson force-pushed the 7759-Add-CsvFileFormat-for-CSV branch from 7001e17 to d8efa2d Compare April 24, 2020 20:01

nealrichardson reviewed Apr 24, 2020

View reviewed changes

cpp/src/arrow/dataset/file_csv.h Outdated Show resolved Hide resolved

fsaintjacques requested changes Apr 27, 2020

View reviewed changes

cpp/src/arrow/dataset/file_csv.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/file_csv.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/file_csv.cc Outdated Show resolved Hide resolved

cpp/src/arrow/dataset/file_csv.cc Outdated Show resolved Hide resolved

bkietz force-pushed the 7759-Add-CsvFileFormat-for-CSV branch 2 times, most recently from e63856a to 698a613 Compare April 28, 2020 19:59

github-actions bot force-pushed the 7759-Add-CsvFileFormat-for-CSV branch from 1a1ec4f to 24e93d3 Compare April 28, 2020 20:17

jorisvandenbossche reviewed Apr 29, 2020

View reviewed changes

bkietz and others added 11 commits April 29, 2020 20:40

ARROW-7759: [C++][Dataset] Add CsvFileFormat

598558e

review comments

b843048

lint fixes

920c269

add failing edge case test

fb443d3

add python bindings

3ce9e15

Simple CSV bindings for R

f69f1fb

Support csv ParseOptions in dataset (TODO: update docs)

2052925

Update python/pyarrow/tests/test_dataset.py

5179b5b

Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update cpp/src/arrow/dataset/file_csv.h

447b1d7

Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>

Update R docs

02593f3

revert all CsvFileFormat options except parse_options

58a1059

bkietz force-pushed the 7759-Add-CsvFileFormat-for-CSV branch from c717b1c to 58a1059 Compare April 30, 2020 00:41

fsaintjacques approved these changes Apr 30, 2020

View reviewed changes

nealrichardson approved these changes Apr 30, 2020

View reviewed changes

bkietz closed this in 8d5ca47 Apr 30, 2020

bkietz deleted the 7759-Add-CsvFileFormat-for-CSV branch February 25, 2021 16:52

This was referenced Apr 30, 2020

[C++][Dataset] Add CsvFileFormat for CSV support #23997

Closed

[C++][Dataset] Pass schema including all materialized fields to catch CSV edge cases #24792

Closed

[C++][Dataset] Add ConvertOptions and ReadOptions to CsvFileFormat #24793

Closed

Uh oh!

ARROW-7759: [C++][Dataset] Add CsvFileFormat #7033

ARROW-7759: [C++][Dataset] Add CsvFileFormat #7033

Uh oh!

Conversation

bkietz commented Apr 24, 2020

Uh oh!

github-actions bot commented Apr 24, 2020

Uh oh!

nealrichardson commented Apr 24, 2020

Uh oh!

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nealrichardson commented Apr 28, 2020

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bkietz commented Apr 29, 2020

Uh oh!

nealrichardson commented Apr 29, 2020

Uh oh!

bkietz commented Apr 29, 2020

Uh oh!

jorisvandenbossche commented Apr 30, 2020

Uh oh!

bkietz commented Apr 30, 2020

Uh oh!

fsaintjacques left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants