-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ListingTable single_file option #8604
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @devinjdangelo -- it is great to remove the old options that are no longer necessary.
@@ -797,7 +787,7 @@ impl TableProvider for ListingTable { | |||
// all of the data at the input is sink after execution finishes. See discussion for rationale: | |||
// https://github.com/apache/arrow-datafusion/pull/7610#issuecomment-1728979918 | |||
unbounded_input: is_plan_streaming(&input)?, | |||
single_file_output: self.options.single_file, | |||
single_file_output: false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this got me thinking we can remove the option from the FileSInkConfig itself.
However I see how that this PR is removing the single_file
option from the read side (the ListingTable) but it can also potentially be important for the write side (aka a COPY statement)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, COPY still uses single_file_output option of the FileSinkConfig struct. We could potentially update COPY to instead rely on inference based on the path rather than an explicit option. E.g. copy table to file.parquet vs copy table to folder/. Then single_file_output could be removed entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 that certainly would be a better UX in my opinion (figuring out to use SINGLE_FILE in https://github.com/apache/arrow-datafusion/blob/98a5a4eb1ea1277f5fe001e1c7602b37592452f1/datafusion/sqllogictest/test_files/repartition_scan.slt#L35-L38 was actually quite painful for me)
Shall I propose a ticket to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #8621 to track removing SINGLE_FILE_OUTPUT completely
.unwrap_or(false); | ||
|
||
// Backwards compatibility (#8547) | ||
// Backwards compatibility (#8547), discard deprecated options |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I merged up from main to resolve a conflict in this PR |
Is there any way to specify the number of output files? |
Yes, datafusion.execution.minimum_parallel_output_files and datafusion.execution.soft_max_rows_per_output_file control the number of files output (when not writing hive partitions, otherwise 1 file is written per partition). These are session level configs and are currently not configurable per table or per COPY statement, but they could be added as statement level options. |
Which issue does this PR close?
closes #8548
Rationale for this change
See issue and linked discussion
What changes are included in this PR?
Removes ListingTable single_file option and removes references in the documentation (also cleaned out some references in the docs to other removed options).
The single_file option is discarded if present to maintain backwards compatibility.
Are these changes tested?
Yes by existing tests
Are there any user-facing changes?
No (support for inserting to single_file listing table was already removed, this PR just cleans up docs and removes unneeded code).