Skip to content

Conversation

@bkietz
Copy link
Member

@bkietz bkietz commented Apr 24, 2020

No description provided.

@bkietz bkietz requested a review from fsaintjacques April 24, 2020 19:11
@github-actions
Copy link

@nealrichardson
Copy link
Member

@github-actions rebase

@nealrichardson nealrichardson force-pushed the 7759-Add-CsvFileFormat-for-CSV branch from 7001e17 to d8efa2d Compare April 24, 2020 20:01
Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great! It can parse parts of the nyc-tlc dataset, schema mismatch related, nothing due to the implementation.

I added 2 missings points:

  • Some of the ReadOptions are important and can't be passed: skip rows and block size
  • There is no column projection done. This one shouldn't be hard to implement since we have most of the machinery.

@bkietz bkietz force-pushed the 7759-Add-CsvFileFormat-for-CSV branch 2 times, most recently from e63856a to 698a613 Compare April 28, 2020 19:59
@nealrichardson
Copy link
Member

@github-actions rebase

@github-actions github-actions bot force-pushed the 7759-Add-CsvFileFormat-for-CSV branch from 1a1ec4f to 24e93d3 Compare April 28, 2020 20:17
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Some general remarks:

Similar remark as when parquet options were added: do we plan to use the CsvFileFormat also for writing? (in the long term, since we don't have csv writing functionality right now). For Parquet there is a ReaderOptions that grouped all options related to reading in the ParquetFileFormat.

I also find the fact that some options are exposed directly, and some using the same struct as in the csv module (ParseOptions) a bit confusing / inconsistent.
But I assume this is because the ParseOptions are used in its entirety, and for csv.ConvertOptions and csv.ReadOptions, only part of them is valid to set here? (another option could also be to check that those options are not set and if set raise an error to the user about that)

Do we want to allow reading csv files without a header line with column names? WIth the csv module, you can do eg csv.read_csv("test.csv",read_options=csv.ReadOptions(column_names=["a", "b"])), but this is currently not possible with the dataset interface (you can specify a schema, but that will only work if matching names are in the file)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be good reasons for it, but this makes it different than eg parquet, where you also get multithreaded reading when only reading a single file (at least with ToTable).

In addition, it could maybe still be exposed to the user?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multithreaded reading of a single CSV file without excessive contention in the many file case would require refactoring the CSV readers to expose their parallelism as tasks which could be productively scheduled alongside the ScanTasks. That's a significant change and should certainly wait for a follow up.

Even after we have such an API, the decision to use threading or not would be derived from scan parameters (see ScannerBuilder::UseThreads) and so it doesn't belong in the format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For my education: the parquet reader has such APIs, and thus there this is not a problem?

use threading or not would be derived from scan parameters (see ScannerBuilder::UseThreads) and so it doesn't belong in the format.

Yep, that's true

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parquet as a format does not require chunking; it is already subdivided into rowgroups. We simply wrap each of those in scan tasks and in this way scan individual files in parallel.

There are several ways we could augment CSV to support something like this: exposing a task interface as I mentioned, including a sidecar file with cached chunk start offsets, using csv::Chunker directly when generating ScanTasks (so each ScanTask wraps a single independently parseable chunk), ...

@bkietz
Copy link
Member Author

bkietz commented Apr 29, 2020

@jorisvandenbossche

do we plan to use the CsvFileFormat also for writing? ... For Parquet there is a ReaderOptions that grouped all options related to reading in the ParquetFileFormat

We added ReaderOptions to avoid collisions with extant parquet write options. Since we don't have CSV write options avoiding collisions seems like specualtive generality; this can be handled when/if we add CSV writing. Also note that (for example) ParseOptions::delimiter would also be used when writing

I assume this is because the ParseOptions are used in its entirety, and for csv.ConvertOptions and csv.ReadOptions, only part of them is valid to set here? (another option could also be to check that those options are not set and if set raise an error to the user about that)

That is indeed why I used only ParseOptions whole. It seemed better to me than documenting which fields are ignored or emitting errors when things aren't defaulted.

Do we want to allow reading csv files without a header line with column names?

I can add column_names and autogenerate* too if required, but maybe that could wait for a follow up? Since the way I handled /{Convert,Read}Options/ is evidently not transparent maybe everything except ParseOptions should wait as well?

@nealrichardson
Copy link
Member

I'll fix the R failures in the next couple of hours (docs need revision).

@bkietz
Copy link
Member Author

bkietz commented Apr 29, 2020

Reverted everything except parse_options. Follow up: https://issues.apache.org/jira/browse/ARROW-8631

@bkietz bkietz force-pushed the 7759-Add-CsvFileFormat-for-CSV branch from c717b1c to 58a1059 Compare April 30, 2020 00:41
@jorisvandenbossche
Copy link
Member

Fine to leave the other options as a follow-up. We will want to add them at some point of course. Another option could also be to expose all keyword as direct options (also the ParseOptions keywords), to avoid the discrepancy between ParseOptions as struct vs single options for the others.

@bkietz
Copy link
Member Author

bkietz commented Apr 30, 2020

I'm happy to implement whatever configuration is agreeable. I'll add a list of the approaches which have been discussed here to the follow-up so we can discuss them there.

Copy link
Contributor

@fsaintjacques fsaintjacques left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, we'll have to review how to pass ConvertOptions in CsvFileFormat because this may be the only option for some users to parse broken CSV, e.g. ignore columns is probably the most important one.

if (scan_options != nullptr) {
// This is set to true to match behavior with other formats; a missing column
// will be materialized as null.
options.include_missing_columns = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the FilterAndProjectScanTask to fix this in a more efficient way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just to stop conversion from erroring if a column is projected but absent from the file. Another way to handle this would be: restrict include_colums to columns present in the file then let the projector handle the rest as you say, but that would require knowledge of the file's columns which we don't have at this stage


static inline Result<std::shared_ptr<csv::StreamingReader>> OpenReader(
const FileSource& source, const CsvFileFormat& format,
const std::shared_ptr<ScanOptions>& options = nullptr,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd take a ConvertOption instead of a ScanTask.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean

};

Result<bool> CsvFileFormat::IsSupported(const FileSource& source) const {
RETURN_NOT_OK(source.Open().status());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you let OpenReader fail?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If source fails to open (for example if it points to a file which doesn't exist) that should raise an error rather than returning false, which is what this line detects.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @bkietz !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants