Skip to content

Conversation

@lidavidm
Copy link
Member

This adds ReadOptions to CsvFileFormat and exposes ReadOptions, ConvertOptions, and CsvFragmentScanOptions to Python.

ReadOptions was added to CsvFileFormat as its options can affect the discovered schema. For the block size, which does not need to be global, a field was added to CsvFragmentScanOptions.

@github-actions
Copy link

Copy link
Member Author

@lidavidm lidavidm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This also needs to support R.

Copy link
Member

@jorisvandenbossche jorisvandenbossche Mar 18, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just ScanOptions instead of FragmentScanOptions might be more descriptive? (I find the "fragment" in it a bit confusing) Because it's not that this can be set for each fragment. It's the same for all fragments for one scan.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, but we already have a ScanOptions of course ;)
Then maybe FormatScanOptions? Since it are format-specific options?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, but that does collide with ScanOptions itself, unless you mean just the naming of the builder method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see…hmm, but if we had a hypothetical FlightFragment, we'd still want to have scan options specific to that fragment, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be "FlightScanOptios" then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the slight niggle I have there is that there wouldn't be a corresponding 'Flight(File)Format'. Maybe 'PerScanOptions'? But it's not a big deal and FormatScanOptions is OK with me too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, OK, I see now that the "Fragment" in the name was meant to mean the general "fragment type", while I interpreted it as the "single fragment".
Anyway, this is more a nitpicky remark, not too important ;)

@jorisvandenbossche
Copy link
Member

This will be a really nice improvement for reading CSV files with the datasets API. I just answered (maybe a bit prematurely) an SO question that needs this (about CSV files without header rows, https://stackoverflow.com/questions/66585648/reading-partitioned-datasets-stored-as-csv-with-pyarrow-dataset/66655815)

@lidavidm
Copy link
Member Author

Looks like the JNI/datasets bindings build once it's kicked again.

@nealrichardson
Copy link
Member

The R user experience here is not good; I'm happy to improve it in a followup, but I'm not sure how feasible that will be. I'm not sure I understand why FragmentScanOptions is separate from CsvFileFormat--it seems that all of the options I provide there are csv-specific. The issue is that I want to declare those (null_values, etc.) up front, along with all of the other parsing instructions for the files (column_names, etc.). If we decide that those are two objects, ok, but that means that open_dataset() needs to assemble the Dataset (via DatasetFactory, creating a CsvFileFormat along the way) and FragmentScanOptions and then some how attach the FragmentScanOptions to the R Dataset object and carry it around until a scan is initialized on the dataset. (R users never make a ScannerBuilder themselves, it's all wrapped in higher-level functions.) Maybe that's not a problem but that's not something we've had to do elsewhere.

@lidavidm
Copy link
Member Author

The motivation was to support more advanced users who might want to scan the same files repeatedly with different options. But that is a niche use case and the common case is a bit confusing. Logically, the separation is roughly between 'things that would change the schema or format', e.g. the separator, or rows to skip, and 'everything else', e.g. the set of null values - but this isn't obvious to a user who probably just wants to specify all their options together.

Maybe the respective scan options could be inlined or embedded into the file format to provide defaults? Which could then be overridden if a user wants to do something more complex. That would be some boilerplate, but would make things easier.

@nealrichardson
Copy link
Member

Maybe the respective scan options could be inlined or embedded into the file format to provide defaults?

Yeah I think that would be nice. I don't understand well the use case of scanning the same files with different parsing options unless I'm trying to figure out what the "right" options are. To me, things like null_values are not scan-time preferences, they're properties that describe what's in the files, so I want to declare them up front and don't need to adjust them later.

Is there a reason one would need to scan the same dataset with different parsing options, rather than create a new dataset with the options specified up front? I wonder whether the extra complexity in accepting them also at scan time is worth it if there's a simple solution like that.

@lidavidm
Copy link
Member Author

@bkietz Is it okay to share your doc, or quote from it? I think it has the necessary context.

@lidavidm
Copy link
Member Author

lidavidm commented Mar 19, 2021

@nealrichardson let me know if the API in the latest commit is closer to what you'd expect; if there aren't objections I can clean it up and implement the missing bits (and maybe see if I can make csv_file_format_parse_options handle keywords for convert and parse options).

(test-dataset.R to save you the trouble of scrolling)

@nealrichardson
Copy link
Member

Ultimately what I'm looking for is open_dataset(path, format = "csv", column_names = c(..), null_values = c("NA", "null", "."), strings_can_be_null = TRUE) and have all of those args go through to the right place. The average R user shouldn't have to call FileFormat$create() directly. I think the extra args are already passed to FileFormat$create() inside the dataset factory constructor, so you're right, csv_file_format_parse_options() probably just needs to be reworked. I'm happy to take that on (or rope @jonkeane or @ianmcook into it) if you want to make a followup JIRA, or feel free to push it ahead yourself, you've done a great job getting it to here!

@lidavidm lidavidm force-pushed the arrow-8631 branch 2 times, most recently from d0bd5f0 to c595a40 Compare March 19, 2021 22:52
@lidavidm
Copy link
Member Author

To share the context for this refactor, Ben has this doc: https://docs.google.com/document/d/1LzlDnnmKGCkD9RWGXyMQDHwf14Ad9K4ojn9AafkGFSg/edit?usp=sharing

This is something we need to do; dask and other advanced parquet consumers need ridiculously sophisticated hooks for scanning (let alone writing). For example: whether to populate statistics (for reading into a single table with no filter there is no point in converting statistics to expressions), whether they should be accumulated or cached (cudf folks wanted to copy the unparsed metadata buffers to the GPU), conversion details (dict_columns might be interactively decided when a string column is discovered to have few distinct values), I/O minutiae (stream block size/buffering/chunking/... might be decided after a scan starts taking too long), ...

Depending on what everyone thinks, I may revisit the implementation, but yes, let's try to present a convenient API for R/Python users, and have a nice feature to announce for 4.0.

Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this! Just a couple of final notes.

Comment on lines +314 to +317
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want to accept scan_options in collect() since that's the way that users typically do a scan, but that can be done in a followup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed ARROW-12059.

@lidavidm
Copy link
Member Author

This should be ready again (minus what seems to be a Travis queue buildup)

Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants