datafusion-cli: Use correct S3 region if it is not specified#16502
datafusion-cli: Use correct S3 region if it is not specified#16502alamb merged 2 commits intoapache:mainfrom
Conversation
fa796a8 to
1800fc4
Compare
1800fc4 to
126b9e1
Compare
|
Hi @alamb, this PR is ready for review! One concern here is whether it is acceptable to test against a bucket outside of datafusion, but I couldn’t find a good way to test region retry against local minio. |
|
Thanks @liamzwbao -- I'll put this one on my review queue for tomorrow |
alamb
left a comment
There was a problem hiding this comment.
Thank you @liamzwbao -- I tried this out and it works great! Also very nice tests.
DataFusion CLI v48.0.0
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
0 row(s) fetched.
Elapsed 1.737 seconds.
...I think the code to execute statements is getting a bit complicated however. I wonder if there is any way we can encapsulate it more (I left a comment)
Maybe we can do the refactoring as a folllow on PR
datafusion-cli/src/exec.rs
Outdated
| AdjustedPrintOptions::new(print_options.clone()).with_statement(&statement); | ||
|
|
||
| let plan = create_plan(ctx, statement).await?; | ||
| // Only clone the statement if it's a CreateExternalTable |
There was a problem hiding this comment.
This logic is pretty specific to S3 and create_table -- inlining it here obscures the rest of the control flow that is happening.
Is there any way we can encapsulate these special cases somewhere?
Perhaps something like
struct StatementExecutor {
statement: DFStatement
...
// should we automatically look up the region of a failed s3 request?
retry_region: bool,
}
impl StatementExecutor {
async fn execute(&mut self) {}
}Then this loop could look something like
StatementExecutor::new(ctx, print_options).await?;
}
?
There was a problem hiding this comment.
I think then we would have a place to put retry-region
There was a problem hiding this comment.
Sure, I also feel like putting the logic here hurts the readability.
I searched in the code and looks like it's probably ok to always clone the statement here, so I simplify the logic in the exec_and_print to make it a bit clearer.
I can revert to the previous version if you think that one is better.
Also created an issue #16559 to encapsulate the logic in a followup PR
| use url::Url; | ||
|
|
||
| #[cfg(not(test))] | ||
| use object_store::aws::resolve_bucket_region; |
34b37f4 to
4c29697
Compare
|
Thank you @liamzwbao |
…16502) * datafusion-cli: Use correct S3 region if it is not specified * Simplify the retry logic
Which issue does this PR close?
datafusion-cli: Use correct S3 region if it is not specified #16306.Rationale for this change
Improve UX by resolving s3 region automatically.
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
Yes, the CLI can automatically resolve the region for bucket now