-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: 2061 create external table ddl table partition cols #2099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: 2061 create external table ddl table partition cols #2099
Conversation
| }; | ||
|
|
||
| /// CSV file read option | ||
| #[derive(Copy, Clone)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arvo and Json read options do not use Copy trait. Removing it allows to use Vec<String>
| schema_infer_max_records: 1000, | ||
| delimiter: b',', | ||
| file_extension: ".csv", | ||
| file_extension: DEFAULT_CSV_EXTENSION, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with same constant
|
|
||
| ListingOptions { | ||
| format: Arc::new(file_format), | ||
| collect_stat: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this true align with the old code in datafusion/src/logical_plan/builder.rs
| } | ||
| } | ||
|
|
||
| fn parse_partitions(&mut self) -> Result<Vec<String>, ParserError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to parse_columns. I didn't reuse parse_columns because partitions only have identifier/name, there is no type or other options like column
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry missed this comment yesterday 👍
28f6a8b to
9492d17
Compare
| table_partition_cols: vec![], | ||
| }; | ||
| let listing_options = options | ||
| .parquet_pruning(parquet_pruning) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is for following test use
disable parquet pruning has one use case from #723
…Table-DDL-table_partition_cols # Conflicts: # ballista/rust/client/src/context.rs # datafusion/src/execution/context.rs
|
will merge master or rebase, because of the whole folder re-org in #2081 |
…Table-DDL-table_partition_cols
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| pub schema_infer_max_records: usize, | ||
| /// File extension; only files with this extension are selected for data input. | ||
| /// Defaults to ".csv". | ||
| /// Defaults to DEFAULT_CSV_EXTENSION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| } | ||
|
|
||
| /// Helper to convert these user facing options to `ListingTable` options | ||
| pub fn to_listing_options(&self, target_partitions: usize) -> ListingOptions { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
| target_partitions: usize, | ||
| table_name: impl Into<String>, | ||
| ) -> Result<Self> { | ||
| // TODO remove hard coded enable_pruning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
| return Ok(partitions); | ||
| } | ||
|
|
||
| loop { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given this code to parse a comma separated list is duplicated in parse_columns below, perhaps we could refactor into a common function to reduce the replication -- not needed for this PR though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it maybe possible. Or maybe we can expose this as public method in sqlparser crate?
(agree not needed for this PR)
This is a copy of the equivalent implementation in sqlparser
| LOCATION '/path/to/aggregate_test_100.csv'; | ||
| ``` | ||
|
|
||
| If data sources are already partitioned in Hive style, `PARTITIONED BY` can be used for partition pruning. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
b97bf56 to
e8993eb
Compare
|
Thanks again @jychen7 |
Which issue does this PR close?
Closes #2061
Rationale for this change
support partition pruning using CreateExternalTable DDL
What changes are included in this PR?
read_parquetandregister_parquetIt maybe easier to review by commits, since first commit adds the functionality and other commits just modify a few dependent file/tests that use
read_parquetandregister_parquetwith default ParquetReadOptionsAre there any user-facing changes?
No change if user is using SQL.
However, there is change in
context.read_parquetandcontext.register_parquetparams, receiving a ParquetReadOptions fortable_partition_cols. It is more align with other file format, since Csv, Avro, Json already have their own ReadOptionsManual Test in datafusion-cli
create tmp sample file
run in datafusion-cli