Skip to content

Latest commit

 

History

History
349 lines (274 loc) · 10.3 KB

datasources.md

File metadata and controls

349 lines (274 loc) · 10.3 KB

Local Files / Directories

Files can be queried directly by enclosing the file, directory name or a remote location in single ' quotes as shown in the examples.

Create a CSV file to query.

$ echo "a,b" > data.csv
$ echo "1,2" >> data.csv

Query that single file (the CLI also supports parquet, compressed csv, avro, json and more)

$ datafusion-cli
DataFusion CLI v17.0.0
> select * from 'data.csv';
+---+---+
| a | b |
+---+---+
| 1 | 2 |
+---+---+
1 row in set. Query took 0.007 seconds.

You can also query directories of files with compatible schemas:

$ ls data_dir/
data.csv   data2.csv
$ datafusion-cli
DataFusion CLI v16.0.0
> select * from 'data_dir';
+---+---+
| a | b |
+---+---+
| 3 | 4 |
| 1 | 2 |
+---+---+
2 rows in set. Query took 0.007 seconds.

Remote Files / Directories

You can also query directly any remote location supported by DataFusion without registering the location as a table. For example, to read from a remote parquet file via HTTP(S) you can use the following:

select count(*) from 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet'
+----------+
| COUNT(*) |
+----------+
| 1000000  |
+----------+
1 row in set. Query took 0.595 seconds.

To read from an AWS S3 or GCS, use s3 or gs as a protocol prefix. For example, to read a file in an S3 bucket named my-data-bucket use the URL s3://my-data-bucketand set the relevant access credentials as environmental variables (e.g. for AWS S3 you need to at least AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY).

select count(*) from 's3://my-data-bucket/athena_partitioned/hits.parquet'

See the CREATE EXTERNAL TABLE section for additional configuration options.

CREATE EXTERNAL TABLE

It is also possible to create a table backed by files or remote locations via CREATE EXTERNAL TABLE as shown below. Note that wildcards (e.g. *) are also supported

For example, to create a table hits backed by a local parquet file, use:

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits.parquet';

To create a table hits backed by a remote parquet file via HTTP(S), use

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';

In both cases, hits now can be queried as a regular table:

select count(*) from hits;
+----------+
| COUNT(*) |
+----------+
| 1000000  |
+----------+
1 row in set. Query took 0.344 seconds.

Formats

Parquet

The schema information for parquet will be derived automatically.

Register a single file parquet datasource

CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/mnt/nyctaxi/tripdata.parquet';

Register a single folder parquet datasource. Note: All files inside must be valid parquet files and have compatible schemas

CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/mnt/nyctaxi/';

Register a single folder parquet datasource by specifying a wildcard for files to read

CREATE EXTERNAL TABLE taxi
STORED AS PARQUET
LOCATION '/mnt/nyctaxi/*.parquet';

CSV

DataFusion will infer the CSV schema automatically or you can provide it explicitly.

Register a single file csv datasource with a header row.

CREATE EXTERNAL TABLE test
STORED AS CSV
LOCATION '/path/to/aggregate_test_100.csv'
OPTIONS ('has_header' 'true');

Register a single file csv datasource with explicitly defined schema.

CREATE EXTERNAL TABLE test (
    c1  VARCHAR NOT NULL,
    c2  INT NOT NULL,
    c3  SMALLINT NOT NULL,
    c4  SMALLINT NOT NULL,
    c5  INT NOT NULL,
    c6  BIGINT NOT NULL,
    c7  SMALLINT NOT NULL,
    c8  INT NOT NULL,
    c9  BIGINT NOT NULL,
    c10 VARCHAR NOT NULL,
    c11 FLOAT NOT NULL,
    c12 DOUBLE NOT NULL,
    c13 VARCHAR NOT NULL
)
STORED AS CSV
LOCATION '/path/to/aggregate_test_100.csv';

Locations

HTTP(s)

To read from a remote parquet file via HTTP(S) you can use the following:

CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'https://datasets.clickhouse.com/hits_compatible/athena_partitioned/hits_1.parquet';

S3

AWS S3 data sources must have connection credentials configured.

To create an external table from a file in an S3 bucket:

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'aws.access_key_id' '******',
    'aws.secret_access_key' '******',
    'aws.region' 'us-east-2'
)
LOCATION 's3://bucket/path/file.parquet';

It is also possible to specify the access information using environment variables:

$ export AWS_DEFAULT_REGION=us-east-2
$ export AWS_SECRET_ACCESS_KEY=******
$ export AWS_ACCESS_KEY_ID=******

$ datafusion-cli
`datafusion-cli v21.0.0
> create external table test stored as parquet location 's3://bucket/path/file.parquet';
0 rows in set. Query took 0.374 seconds.
> select * from test;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1        | 2        |
+----------+----------+
1 row in set. Query took 0.171 seconds.

Supported configuration options are:

Environment Variable Configuration Option Description
AWS_ACCESS_KEY_ID aws.access_key_id
AWS_SECRET_ACCESS_KEY aws.secret_access_key
AWS_DEFAULT_REGION aws.region
AWS_ENDPOINT aws.endpoint
AWS_SESSION_TOKEN aws.token
AWS_CONTAINER_CREDENTIALS_RELATIVE_URI See IAM Roles
AWS_ALLOW_HTTP set to "true" to permit HTTP connections without TLS
AWS_PROFILE Support for using a named profile to supply credentials

OSS

Alibaba cloud OSS data sources must have connection credentials configured

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'aws.access_key_id' '******',
    'aws.secret_access_key' '******',
    'aws.oss.endpoint' 'https://bucket.oss-cn-hangzhou.aliyuncs.com'
)
LOCATION 'oss://bucket/path/file.parquet';

The supported OPTIONS are

  • access_key_id
  • secret_access_key
  • endpoint

Note that the endpoint format of oss needs to be: https://{bucket}.{oss-region-endpoint}

COS

Tencent cloud COS data sources data sources must have connection credentials configured

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'aws.access_key_id' '******',
    'aws.secret_access_key' '******',
    'aws.cos.endpoint' 'https://cos.ap-singapore.myqcloud.com'
)
LOCATION 'cos://bucket/path/file.parquet';

The supported OPTIONS are:

  • access_key_id
  • secret_access_key
  • endpoint

Note that the endpoint format of urls must be: https://cos.{cos-region-endpoint}

GCS

Google Cloud Storage data sources must have connection credentials configured

For example, to create an external table from a file in a GCS bucket

CREATE EXTERNAL TABLE test
STORED AS PARQUET
OPTIONS(
    'gcp.service_account_path' '/tmp/gcs.json',
)
LOCATION 'gs://bucket/path/file.parquet';

It is also possible to specify the access information using environment variables:

$ export GOOGLE_SERVICE_ACCOUNT=/tmp/gcs.json

$ datafusion-cli
DataFusion CLI v21.0.0
> create external table test stored as parquet location 'gs://bucket/path/file.parquet';
0 rows in set. Query took 0.374 seconds.
> select * from test;
+----------+----------+
| column_1 | column_2 |
+----------+----------+
| 1        | 2        |
+----------+----------+
1 row in set. Query took 0.171 seconds.

Supported configuration options are:

Environment Variable Configuration Option Description
GOOGLE_SERVICE_ACCOUNT gcp.service_account_path location of service account file
GOOGLE_SERVICE_ACCOUNT_PATH gcp.service_account_path (alias) location of service account file
SERVICE_ACCOUNT gcp.service_account_path (alias) location of service account file
GOOGLE_SERVICE_ACCOUNT_KEY gcp.service_account_key JSON serialized service account key
GOOGLE_APPLICATION_CREDENTIALS gcp.application_credentials_path location of application credentials file
GOOGLE_BUCKET bucket name
GOOGLE_BUCKET_NAME (alias) bucket name