Skip to content

datafusion-cli: Use correct S3 region if it is not specified #16306

Open
@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

I would like to make it easy to use datafusion-cli to query files on S3 as possible

For example, after #16299 is merged I would like to be able to read from the ClickBench example datasets:

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';

However, when I run this I get the following error:

> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet';
Object Store error: Generic S3 error: Error performing HEAD https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet in 499.73175ms - Received redirect without LOCATION, this normally indicates an incorrectly configured region

This does give me the hint that the region is incorrectly configured which is good, however, it doesn't tell me "WHAT" region I need

If I provide the correct region (eu-central-1) it works great:

> CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 's3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet' OPTIONS ('aws.region' 'eu-central-1');
0 row(s) fetched.
Elapsed 1.182 seconds.

> select count(*) from hits;
+----------+
| count(*) |
+----------+
| 1000000  |
+----------+
1 row(s) fetched.
Elapsed 0.780 seconds.

I noticed that that DuckDB and ClickHouse do not require the region to be set:

v1.2.2 7c039464e4
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D select count(*) from read_parquet('s3://clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet');
┌────────────────┐
│  count_star()  │
│     int64      │
├────────────────┤
│    1000000     │
│ (1.00 million) │
└────────────────┘

Describe the solution you'd like

I would like datafusion-cli to automatically find the region as well

I did some investigation and the correct region is returned via a response header, which you can see via

curl -v -X HEAD https://s3.us-east-1.amazonaws.com/clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet
...
...
> HEAD /clickhouse-public-datasets/hits_compatible/athena_partitioned/hits_1.parquet HTTP/1.1
> Host: s3.us-east-1.amazonaws.com
> User-Agent: curl/8.7.1
> Accept: */*
>
* Request completely sent off
< HTTP/1.1 301 Moved Permanently
< x-amz-bucket-region: eu-central-1
< x-amz-request-id: Q44G0APVQH5JHHC4
< x-amz-id-2: cubLiiba/Q138g5SbNNlSoGtARMxobuq7GhA+3t39il+Wj50HNPBUh4bOGVS2Bwlc6k4f0lp6r0=
< Content-Type: application/xml
< Transfer-Encoding: chunked
< Date: Fri, 06 Jun 2025 14:19:57 GMT
< Server: AmazonS3

Note the x-amz-bucket-region in the response:

< x-amz-bucket-region: eu-central-1

I suspect this will need some change upstream in the object_store crate and I will work on filing an upstream ticket now

Describe alternatives you've considered

No response

Additional context

Upstream ticket

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions