Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ListingTable single_file option #8604

Merged
merged 3 commits into from
Dec 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 1 addition & 11 deletions datafusion/core/src/datasource/listing/table.rs
Original file line number Diff line number Diff line change
Expand Up @@ -246,9 +246,6 @@ pub struct ListingOptions {
/// multiple equivalent orderings, the outer `Vec` will have a
/// single element.
pub file_sort_order: Vec<Vec<Expr>>,
/// This setting when true indicates that the table is backed by a single file.
/// Any inserts to the table may only append to this existing file.
pub single_file: bool,
/// This setting holds file format specific options which should be used
/// when inserting into this table.
pub file_type_write_options: Option<FileTypeWriterOptions>,
Expand All @@ -269,7 +266,6 @@ impl ListingOptions {
collect_stat: true,
target_partitions: 1,
file_sort_order: vec![],
single_file: false,
file_type_write_options: None,
}
}
Expand Down Expand Up @@ -421,12 +417,6 @@ impl ListingOptions {
self
}

/// Configure if this table is backed by a sigle file
pub fn with_single_file(mut self, single_file: bool) -> Self {
self.single_file = single_file;
self
}

/// Configure file format specific writing options.
pub fn with_write_options(
mut self,
Expand Down Expand Up @@ -790,7 +780,7 @@ impl TableProvider for ListingTable {
file_groups,
output_schema: self.schema(),
table_partition_cols: self.options.table_partition_cols.clone(),
single_file_output: self.options.single_file,
single_file_output: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this got me thinking we can remove the option from the FileSInkConfig itself.

However I see how that this PR is removing the single_file option from the read side (the ListingTable) but it can also potentially be important for the write side (aka a COPY statement)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, COPY still uses single_file_output option of the FileSinkConfig struct. We could potentially update COPY to instead rely on inference based on the path rather than an explicit option. E.g. copy table to file.parquet vs copy table to folder/. Then single_file_output could be removed entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 that certainly would be a better UX in my opinion (figuring out to use SINGLE_FILE in https://github.com/apache/arrow-datafusion/blob/98a5a4eb1ea1277f5fe001e1c7602b37592452f1/datafusion/sqllogictest/test_files/repartition_scan.slt#L35-L38 was actually quite painful for me)

Shall I propose a ticket to do it?

Copy link
Contributor

@alamb alamb Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #8621 to track removing SINGLE_FILE_OUTPUT completely

overwrite,
file_type_writer_options,
};
Expand Down
9 changes: 2 additions & 7 deletions datafusion/core/src/datasource/listing_table_factory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -135,12 +135,8 @@ impl TableProviderFactory for ListingTableFactory {

let mut statement_options = StatementOptions::from(&cmd.options);

// Extract ListingTable specific options if present or set default
let single_file = statement_options
.take_bool_option("single_file")?
.unwrap_or(false);

// Backwards compatibility (#8547)
// Backwards compatibility (#8547), discard deprecated options
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

statement_options.take_bool_option("single_file")?;
if let Some(s) = statement_options.take_str_option("insert_mode") {
if !s.eq_ignore_ascii_case("append_new_files") {
return plan_err!("Unknown or unsupported insert mode {s}. Only append_new_files supported");
Expand Down Expand Up @@ -195,7 +191,6 @@ impl TableProviderFactory for ListingTableFactory {
.with_target_partitions(state.config().target_partitions())
.with_table_partition_cols(table_partition_cols)
.with_file_sort_order(cmd.order_exprs.clone())
.with_single_file(single_file)
.with_write_options(file_type_writer_options);

let resolved_schema = match provided_schema {
Expand Down
15 changes: 3 additions & 12 deletions docs/source/user-guide/sql/write_options.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,12 +42,11 @@ WITH HEADER ROW
DELIMITER ';'
LOCATION '/test/location/my_csv_table/'
OPTIONS(
CREATE_LOCAL_PATH 'true',
NULL_VALUE 'NAN'
);
```

When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. CREATE_LOCAL_PATH is a special option that indicates if DataFusion should create local file paths when writing new files if they do not already exist. This option is useful if you wish to create an external table from scratch, using only DataFusion SQL statements. Finally, NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file.
When running `INSERT INTO my_table ...`, the options from the `CREATE TABLE` will be respected (gzip compression, special delimiter, and header row included). Note that compression, header, and delimiter settings can also be specified within the `OPTIONS` tuple list. Dedicated syntax within the SQL statement always takes precedence over arbitrary option tuples, so if both are specified the `OPTIONS` setting will be ignored. NULL_VALUE is a CSV format specific option that determines how null values should be encoded within the CSV file.

Finally, options can be passed when running a `COPY` command.

Expand All @@ -70,17 +69,9 @@ In this example, we write the entirety of `source_table` out to a folder of parq
The following special options are specific to the `COPY` command.

| Option | Description | Default Value |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- |
| ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------- | --- |
| SINGLE_FILE_OUTPUT | If true, COPY query will write output to a single file. Otherwise, multiple files will be written to a directory in parallel. | true |
| FORMAT | Specifies the file format COPY query will write out. If single_file_output is false or the format cannot be inferred from the file extension, then FORMAT must be specified. | N/A |

### CREATE EXTERNAL TABLE Specific Options

The following special options are specific to creating an external table.

| Option | Description | Default Value |
| ----------- | --------------------------------------------------------------------------------------------------------------------- | ------------- |
| SINGLE_FILE | If true, indicates that this external table is backed by a single file. INSERT INTO queries will append to this file. | false |
| FORMAT | Specifies the file format COPY query will write out. If single_file_output is false or the format cannot be inferred from the file extension, then FORMAT must be specified. | N/A | |

### JSON Format Specific Options

Expand Down