-
Couldn't load subscription status.
- Fork 3.9k
ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python #9725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This also needs to support R.
cpp/src/arrow/dataset/scanner.h
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just ScanOptions instead of FragmentScanOptions might be more descriptive? (I find the "fragment" in it a bit confusing) Because it's not that this can be set for each fragment. It's the same for all fragments for one scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, but we already have a ScanOptions of course ;)
Then maybe FormatScanOptions? Since it are format-specific options?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, but that does collide with ScanOptions itself, unless you mean just the naming of the builder method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see…hmm, but if we had a hypothetical FlightFragment, we'd still want to have scan options specific to that fragment, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That would be "FlightScanOptios" then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the slight niggle I have there is that there wouldn't be a corresponding 'Flight(File)Format'. Maybe 'PerScanOptions'? But it's not a big deal and FormatScanOptions is OK with me too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, OK, I see now that the "Fragment" in the name was meant to mean the general "fragment type", while I interpreted it as the "single fragment".
Anyway, this is more a nitpicky remark, not too important ;)
|
This will be a really nice improvement for reading CSV files with the datasets API. I just answered (maybe a bit prematurely) an SO question that needs this (about CSV files without header rows, https://stackoverflow.com/questions/66585648/reading-partitioned-datasets-stored-as-csv-with-pyarrow-dataset/66655815) |
|
Looks like the JNI/datasets bindings build once it's kicked again. |
|
The R user experience here is not good; I'm happy to improve it in a followup, but I'm not sure how feasible that will be. I'm not sure I understand why |
|
The motivation was to support more advanced users who might want to scan the same files repeatedly with different options. But that is a niche use case and the common case is a bit confusing. Logically, the separation is roughly between 'things that would change the schema or format', e.g. the separator, or rows to skip, and 'everything else', e.g. the set of null values - but this isn't obvious to a user who probably just wants to specify all their options together. Maybe the respective scan options could be inlined or embedded into the file format to provide defaults? Which could then be overridden if a user wants to do something more complex. That would be some boilerplate, but would make things easier. |
Yeah I think that would be nice. I don't understand well the use case of scanning the same files with different parsing options unless I'm trying to figure out what the "right" options are. To me, things like Is there a reason one would need to scan the same dataset with different parsing options, rather than create a new dataset with the options specified up front? I wonder whether the extra complexity in accepting them also at scan time is worth it if there's a simple solution like that. |
|
@bkietz Is it okay to share your doc, or quote from it? I think it has the necessary context. |
|
@nealrichardson let me know if the API in the latest commit is closer to what you'd expect; if there aren't objections I can clean it up and implement the missing bits (and maybe see if I can make (test-dataset.R to save you the trouble of scrolling) |
|
Ultimately what I'm looking for is |
d0bd5f0 to
c595a40
Compare
|
To share the context for this refactor, Ben has this doc: https://docs.google.com/document/d/1LzlDnnmKGCkD9RWGXyMQDHwf14Ad9K4ojn9AafkGFSg/edit?usp=sharing
Depending on what everyone thinks, I may revisit the implementation, but yes, let's try to present a convenient API for R/Python users, and have a nice feature to announce for 4.0. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Just a couple of final notes.
r/tests/testthat/test-dataset.R
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we'll want to accept scan_options in collect() since that's the way that users typically do a scan, but that can be done in a followup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed ARROW-12059.
Co-authored-by: Neal Richardson <neal.p.richardson@gmail.com>
|
This should be ready again (minus what seems to be a Travis queue buildup) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This adds ReadOptions to CsvFileFormat and exposes ReadOptions, ConvertOptions, and CsvFragmentScanOptions to Python.
ReadOptions was added to CsvFileFormat as its options can affect the discovered schema. For the block size, which does not need to be global, a field was added to CsvFragmentScanOptions.