-
Couldn't load subscription status.
- Fork 3.9k
ARROW-7759: [C++][Dataset] Add CsvFileFormat #7033
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@github-actions rebase |
7001e17 to
d8efa2d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! It can parse parts of the nyc-tlc dataset, schema mismatch related, nothing due to the implementation.
I added 2 missings points:
- Some of the ReadOptions are important and can't be passed: skip rows and block size
- There is no column projection done. This one shouldn't be hard to implement since we have most of the machinery.
e63856a to
698a613
Compare
|
@github-actions rebase |
1a1ec4f to
24e93d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Some general remarks:
Similar remark as when parquet options were added: do we plan to use the CsvFileFormat also for writing? (in the long term, since we don't have csv writing functionality right now). For Parquet there is a ReaderOptions that grouped all options related to reading in the ParquetFileFormat.
I also find the fact that some options are exposed directly, and some using the same struct as in the csv module (ParseOptions) a bit confusing / inconsistent.
But I assume this is because the ParseOptions are used in its entirety, and for csv.ConvertOptions and csv.ReadOptions, only part of them is valid to set here? (another option could also be to check that those options are not set and if set raise an error to the user about that)
Do we want to allow reading csv files without a header line with column names? WIth the csv module, you can do eg csv.read_csv("test.csv",read_options=csv.ReadOptions(column_names=["a", "b"])), but this is currently not possible with the dataset interface (you can specify a schema, but that will only work if matching names are in the file)
cpp/src/arrow/dataset/file_csv.cc
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be good reasons for it, but this makes it different than eg parquet, where you also get multithreaded reading when only reading a single file (at least with ToTable).
In addition, it could maybe still be exposed to the user?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multithreaded reading of a single CSV file without excessive contention in the many file case would require refactoring the CSV readers to expose their parallelism as tasks which could be productively scheduled alongside the ScanTasks. That's a significant change and should certainly wait for a follow up.
Even after we have such an API, the decision to use threading or not would be derived from scan parameters (see ScannerBuilder::UseThreads) and so it doesn't belong in the format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my education: the parquet reader has such APIs, and thus there this is not a problem?
use threading or not would be derived from scan parameters (see ScannerBuilder::UseThreads) and so it doesn't belong in the format.
Yep, that's true
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parquet as a format does not require chunking; it is already subdivided into rowgroups. We simply wrap each of those in scan tasks and in this way scan individual files in parallel.
There are several ways we could augment CSV to support something like this: exposing a task interface as I mentioned, including a sidecar file with cached chunk start offsets, using csv::Chunker directly when generating ScanTasks (so each ScanTask wraps a single independently parseable chunk), ...
We added ReaderOptions to avoid collisions with extant parquet write options. Since we don't have CSV write options avoiding collisions seems like specualtive generality; this can be handled when/if we add CSV writing. Also note that (for example)
That is indeed why I used only ParseOptions whole. It seemed better to me than documenting which fields are ignored or emitting errors when things aren't defaulted.
I can add |
|
I'll fix the R failures in the next couple of hours (docs need revision). |
|
Reverted everything except parse_options. Follow up: https://issues.apache.org/jira/browse/ARROW-8631 |
Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Co-Authored-By: Joris Van den Bossche <jorisvandenbossche@gmail.com>
c717b1c to
58a1059
Compare
|
Fine to leave the other options as a follow-up. We will want to add them at some point of course. Another option could also be to expose all keyword as direct options (also the ParseOptions keywords), to avoid the discrepancy between ParseOptions as struct vs single options for the others. |
|
I'm happy to implement whatever configuration is agreeable. I'll add a list of the approaches which have been discussed here to the follow-up so we can discuss them there. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, we'll have to review how to pass ConvertOptions in CsvFileFormat because this may be the only option for some users to parse broken CSV, e.g. ignore columns is probably the most important one.
| if (scan_options != nullptr) { | ||
| // This is set to true to match behavior with other formats; a missing column | ||
| // will be materialized as null. | ||
| options.include_missing_columns = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the FilterAndProjectScanTask to fix this in a more efficient way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just to stop conversion from erroring if a column is projected but absent from the file. Another way to handle this would be: restrict include_colums to columns present in the file then let the projector handle the rest as you say, but that would require knowledge of the file's columns which we don't have at this stage
|
|
||
| static inline Result<std::shared_ptr<csv::StreamingReader>> OpenReader( | ||
| const FileSource& source, const CsvFileFormat& format, | ||
| const std::shared_ptr<ScanOptions>& options = nullptr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd take a ConvertOption instead of a ScanTask.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean
| }; | ||
|
|
||
| Result<bool> CsvFileFormat::IsSupported(const FileSource& source) const { | ||
| RETURN_NOT_OK(source.Open().status()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you let OpenReader fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If source fails to open (for example if it points to a file which doesn't exist) that should raise an error rather than returning false, which is what this line detects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @bkietz !
No description provided.