Description
Is your feature request related to a problem or challenge?
I would like to make it easy to use datafusion-cli to query files on S3 as possible
For example, after #16299 is merged I would like to be able to read from the ClickBench example datasets:
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
However, when I run this I get the following error:
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
Object Store error: Generic S3 error: Error performing HEAD https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet in 499.73175ms - Received redirect without LOCATION, this normally indicates an incorrectly configured region
This does give me the hint that the region is incorrectly configured which is good, however, it doesn't tell me "WHAT" region I need
If I provide the correct region (eu-central-1
) it works great:
> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet' OPTIONS ('aws.region' 'eu-central-1');
0 row(s) fetched.
Elapsed 1.182 seconds.
> select count(*) from hits;
+----------+
| count(*) |
+----------+
| 1000000 |
+----------+
1 row(s) fetched.
Elapsed 0.780 seconds.
I noticed that that DuckDB and ClickHouse do not require the region to be set:
v1.2.2 7c039464e4
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from read_parquet('s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet');
┌────────────────┐
│ count_star() │
│ int64 │
├────────────────┤
│ 1000000 │
│ (1.00 million) │
└────────────────┘
Describe the solution you'd like
I would like datafusion-cli
to automatically find the region as well
I did some investigation and the correct region is returned via a response header, which you can see via
curl -v -X HEAD https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
...
...
> HEAD /clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet HTTP/1.1
> Host: s3.us-east-1.amazonaws.com
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 301 Moved Permanently
< x-amz-bucket-region: eu-central-1
< x-amz-request-id: Q44G0APVQH5JHHC4
< x-amz-id-2: cubLiiba/Q138g5SbNNlSoGtARMxobuq7GhA+3t39il+Wj50HNPBUh4bOGVS2Bwlc6k4f0lp6r0=
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Fri, 06 Jun 2025 14:19:57 GMT
< Server: AmazonS3
Note the x-amz-bucket-region
in the response:
< x-amz-bucket-region: eu-central-1
I suspect this will need some change upstream in the object_store crate and I will work on filing an upstream ticket now
Describe alternatives you've considered
No response
Additional context
Upstream ticket