-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Add CsvExecBuilder for creating CsvExec
#11633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This adds the `CsvExecBuilder` struct for building a `CsvExec` instance, and deprecates the `CsvExec::new` method which has grown too large. There are some `TODO`s related to the duplication of formatting options and their defaults coming from multiple places. Uses of the deprecated `new` method have not been updated yet.
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so nice -- thank you @connec
I left a response to the TODO in the code. We could also address it as a follow on PR as well
| use tokio::task::JoinSet; | ||
|
|
||
| /// Execution plan for scanning a CSV file | ||
| /// Execution plan for scanning a CSV file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
|
|
||
| impl CsvExec { | ||
| /// Create a new CSV reader execution plan provided base and specific configurations | ||
| #[deprecated(since = "41.0.0", note = "use `CsvExec::builder` or `CsvExecBuilder`")] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for keeping (and deprecating the old API) to make migration / upgrade easier.
| false, | ||
| file_compression_type.to_owned(), | ||
| ); | ||
| let csv = CsvExec::builder(config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😍 that is so much nicer 👍
| pub fn new(file_scan_config: FileScanConfig) -> Self { | ||
| Self { | ||
| file_scan_config, | ||
| // TODO: these defaults are duplicated from `CsvOptions` - should they be computed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing we could do is to add a test that ensures the options are the same rather than having to compute them each time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, a test that makes sure that the default settings in a CsvExec matches the default CsvOptions? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, a test that makes sure that the default settings in a CsvExec matches the default CsvOptions? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.
Yes, that is what I was thinking -- that way if the default CsvOptions are ever changed the test will fail until we also update the deafult values in CsvExecBuilder
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As in, a test that makes sure that the default settings in a CsvExec matches the default CsvOptions? I'm happy to add that. I likely won't get to it in the next ~24h so happy to either hold this PR or follow-up after.
Thanks @connec -- no worries! I actually had this PR open locally so I took the liberty of writing the test in 99d158d and pushing that to the branch
|
Thank you @alamb! There's one other datafusion/datafusion/core/src/datasource/physical_plan/csv.rs Lines 97 to 103 in debfd19
I imagine this may be at least somewhat desirable/intentional as part of ensuring crate boundaries? An alternative could be to, e.g., use |
| pub struct CsvExecBuilder { | ||
| file_scan_config: FileScanConfig, | ||
| file_compression_type: FileCompressionType, | ||
| // TODO: it seems like these format options could be reused across all the various CSV config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could possibly use https://docs.rs/datafusion/latest/datafusion/config/struct.CsvOptions.html directly
That seems to be what the parquet writer does https://docs.rs/datafusion/latest/datafusion/datasource/physical_plan/parquet/struct.ParquetExec.html#method.table_parquet_options (aka the ParquetExec directly uses the options struct)
Perhaps as a follow on PR
|
Thanks again @connec |

Which issue does this PR close?
Closes #11565.
Rationale for this change
As part of #11533, the number of arguments required for
CsvExec::newgrew long enough (8) to trigger a clippy lint. In the interests of getting the PR merged we just suppressed the lint, with the intention of replacing it with a builder in a follow-up PR (👋).What changes are included in this PR?
This PR introduces a
CsvExecBuilder, along with plumbing similar to theParquetExecBuilder.CsvExec::newhas been marked as deprecated, and all usages of it in the repo have been replaced withCsvExec::builder.Are these changes tested?
A document test has been added to
CsvExecwhich exercises (some of) the builder public API. This is analogous to the doc test forParquetExec.Additionally,
CsvExec::newwas used itself in several tests and these have all been migrated to the builder.Are there any user-facing changes?
datafusion::datasource::physical_plan::CsvExecBuilder.datafusion::datasource::physical_plan::CsvExec::builder.datafusion::datasource::physical_plan::CsvExec::new.