From 660810a39851378d3dc345f55a329679d098047d Mon Sep 17 00:00:00 2001 From: Michael Hackett Date: Wed, 31 Aug 2022 09:40:27 -0600 Subject: [PATCH] Mass port READMEs to point to Amazon Athena Documentation --- athena-aws-cmdb/README.md | 66 +------- athena-cloudera-hive/README.md | 177 +------------------- athena-cloudera-impala/README.md | 177 +------------------- athena-cloudwatch-metrics/README.md | 85 +--------- athena-cloudwatch/README.md | 74 +-------- athena-docdb/README.md | 107 +----------- athena-dynamodb/README.md | 109 +------------ athena-elasticsearch/README.md | 245 +--------------------------- athena-google-bigquery/README.md | 62 +------ athena-hbase/README.md | 99 +---------- athena-hortonworks-hive/README.md | 176 +------------------- athena-mysql/README.md | 175 +------------------- athena-neptune/README.md | 30 +--- athena-oracle/README.md | 176 +------------------- athena-postgresql/README.md | 179 +------------------- athena-redis/README.md | 89 +--------- athena-redshift/README.md | 168 +------------------ athena-saphana/README.md | 194 +--------------------- athena-snowflake/README.md | 233 +------------------------- athena-sqlserver/README.md | 179 +------------------- athena-synapse/README.md | 176 +------------------- athena-teradata/README.md | 193 +--------------------- athena-timestream/README.md | 74 +-------- athena-tpcds/README.md | 135 +-------------- athena-vertica/README.md | 147 +---------------- 25 files changed, 25 insertions(+), 3500 deletions(-) diff --git a/athena-aws-cmdb/README.md b/athena-aws-cmdb/README.md index fd32b1121a..d1e6a30d3f 100644 --- a/athena-aws-cmdb/README.md +++ b/athena-aws-cmdb/README.md @@ -2,68 +2,4 @@ This connector enables Amazon Athena to communicate with various AWS Services, making your AWS Resource inventory accessible via SQL. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -## Usage - -### Parameters - -The Athena AWS CMDB Connector provides several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) -6. **default_ec2_image_owner** - (Optional) When set, this controls the default ec2 image (aka AMI) owner used to filter AMIs. When this isn't set and your query against the ec2 images table does not include a filter for owner you will get a large number of results since the response will include all public images. - -### Databases & Tables - -The Athena AWS CMDB Connector makes the following databases and tables available for querying your AWS Resource Inventory. For more information on the columns available in each table, try running a 'describe database.table' from the Athena Console or API. - -1. **ec2** - This database contains EC2 related resources, including: - * **ebs_volumes** - Contains details of you EBS volumes. - * **ec2_instances** - Contains details of your EC2 Instances. - * **ec2_images** - Contains details of your EC2 Instance images. - * **routing_tables** - Contains details of your VPC Routing Tables. - * **security_groups** - Contains details of your Security Groups. - * **subnets** - Contains details of your VPC Subnets. - * **vpcs** - Contains details of your VPCs. -2. **emr** - This database contains EMR related resources, including: - * **emr_clusters** - Contains details of your EMR Clusters. -3. **rds** - This database contains RDS related resources, including: - * **rds_instances** - Contains details of your RDS Instances. -4. **s3** - This database contains RDS related resources, including: - * **buckets** - Contains details of your S3 buckets. - * **objects** - Contains details of your S3 Objects (excludes their contents). - -### Required Permissions - -Review the "Policies" section of the athena-aws-cmdb.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -1. EC2 Describe - The connector uses this access to describe your EC2 Instances, Security Groups, VPCs, EBS Volumes, etc... -1. EMR Describe / List - The connector uses this access to describe your EMR Clusters. -1. RDS Describe - The connector uses this access to describe your RDS Instances. -1. S3 List - The connector uses this access to list your buckets and objects. -1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. - -### Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-aws-cmdb dir, run `mvn clean install`. -3. From the athena-aws-cmdb dir, run `../tools/publish.sh S3_BUCKET_NAME athena-aws-cmdb` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) -4. Try running a query like the one below in Athena: -```sql -select * from "lambda:".ec2.ec2_instances limit 100 -``` - -## Performance - -The Athena AWS CMDB Connector does not current support parallel scans. Predicate Pushdown is performed within the Lambda function and where possible partial predicates are pushed to the services being queried. For example, a query for the details of a specific EC2 Instance will turn into a targeted describe of that specific instance id against the EC2 API. - -## License - -This project is licensed under the Apache-2.0 License. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-cmdb.html). diff --git a/athena-cloudera-hive/README.md b/athena-cloudera-hive/README.md index ed89ce7d12..bc35938a8b 100644 --- a/athena-cloudera-hive/README.md +++ b/athena-cloudera-hive/README.md @@ -2,179 +2,4 @@ This connector enables Amazon Athena to access your Cloudera Hive databases. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Cloudera Hive Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `hive://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|HiveMuxCompositeHandler| -|Metadata Handler|HiveMuxMetadataHandler| -|Record Handler|HiveMuxRecordHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is myhivecatalog then the environment variable name should be myhivecatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Hive Mux Lambda function that supports two database instances, hive1host(default) and hive2host: - -|Property|Value| -|---|---| -|default|hive://jdbc:hive2://hive1host:10000/default?${Test/RDS/hive1host}| -| | | -|hive2_catalog1_connection_string|hive://jdbc:hive2://hive1host:10000/default?${Test/RDS/hive1host}| -| | | -|hive2_catalog2_connection_string|hive://jdbc:hive2://hive2host:10000/default?UID=sample&PWD=sample| - -Cloudera Hive Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -hive://jdbc:hive2://hive1host:10000/default?...&${Test/RDS/hive1host}&... -``` - -will be modified to: - -``` -hive://jdbc:hive2://hive1host:10000/default?...&UID=sample2&PWD=sample2&... -``` - -Secret Name `Test/RDS/hive1host` will be used to retrieve secrets. - -Currently Cloudera Hive recognizes `UID` and `PWD` JDBC properties. - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single cloudera Hive instance. -``` -Composite Handler HiveCompositeHandler -Metadata Handler HiveMetadataHandler -Record Handler HiveRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single Cloudera Hive instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|hive://jdbc:hive2://hive1host:10000/default?secret=${Test/RDS/hive1host}| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -|Jdbc| *Cloudera Hive[] |Arrow| -| ---|------------------|---| -|Boolean| Boolean[] |Bit| -|Integer| TINYINT[] |Tiny| -|Short| SMALLINT[] |Smallint| -|Integer| INT[] |Int| -|Long| BIGINT[] |Bigint| -|float| float4[] |Float4| -|Double| float8[] |Float8| -|Date| date[] |DateDay| -|Timestamp| timestamp[] |DateMilli| -|String| VARCHAR[] |Varchar| -|Bytes| bytes[] |Varbinary| -|BigDecimal| Decimal[] |Decimal| -|**\*ARRAY**| **N/A** |List| - -See Cloudera Hive documentation for conversion between JDBC and database types. - -**\*NOTE**: The aggregate types (ARRAY, MAP, STRUCT, and UNIONTYPE) are not yet supported by Cloudera Hive. Columns of aggregate types are treated as VARCHAR columns in SQL and STRING columns in Java. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits -A partition is represented by one partition column of type varchar. We leverage partitions as columns defined on a Cloudera Hive table, and these columns contains parition on columns information. For a table that does not have partition names, * is returned which is equivalent to a single partition. A partition is equivalent to a split. - -| Name | Type | Description | -|-----------|---------|-------------| -| partition | Varchar |Partition information on table columns| - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. -**Cloudera Hive integration Test suite will not create any Cloudera Hive service and Datasets, instead it will use existing Cloudera Hive databases. - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from -source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-cloudera-hive** dir, run `mvn clean install`. -5. From the **athena-cloudera-hive** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-hive2` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. - -# Performance tuning - -Hive supports Static partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend static partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-cloudera-hive.html). diff --git a/athena-cloudera-impala/README.md b/athena-cloudera-impala/README.md index e245b2483f..d7c7dc0f75 100644 --- a/athena-cloudera-impala/README.md +++ b/athena-cloudera-impala/README.md @@ -2,179 +2,4 @@ This connector enables Amazon Athena to access your Cloudera Impala databases. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-impala/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-impala/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Cloudera Impala Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `impala://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|ImpalaMuxCompositeHandler| -|Metadata Handler|ImpalaMuxMetadataHandler| -|Record Handler|ImpalaMuxRecordHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is myImpalacatalog then the environment variable name should be myImpalacatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Impala Mux Lambda function that supports two database instances, Impala1host(default) and Impala2host: - -|Property|Value| -|---|---| -|default|impala://jdbc:impala://Impala1host:10000/default?${Test/RDS/Impala1host}| -| | | -|Impala2_catalog1_connection_string|impala://jdbc:impala://Impala1host:10000/default?${Test/RDS/Impala1host}| -| | | -|Impala2_catalog2_connection_string|impala://jdbc:impala://Impala2host:10000/default?UID=sample&PWD=sample| - -Cloudera Impala Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -Impala://jdbc:Impala://Impala1host:10000/default?...&${Test/RDS/Impala1host}&... -``` - -will be modified to: - -``` -Impala://jdbc:Impala://Impala1host:10000/default?...&UID=sample2&PWD=sample2&... -``` - -Secret Name `Test/RDS/Impala1host` will be used to retrieve secrets. - -Currently Cloudera Impala recognizes `UID` and `PWD` JDBC properties. - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single cloudera Impala instance. -``` -Composite Handler ImpalaCompositeHandler -Metadata Handler ImpalaMetadataHandler -Record Handler ImpalaRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single Cloudera Impala instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|impala://jdbc:impala://Impala1host:10000/default?secret=${Test/RDS/Impala1host}| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -|Jdbc| *Cloudera Impala[] |Arrow| -| ---|------------------|---| -|Boolean| Boolean[] |Bit| -|Integer| TINYINT[] |Tiny| -|Short| SMALLINT[] |Smallint| -|Integer| INT[] |Int| -|Long| BIGINT[] |Bigint| -|float| float4[] |Float4| -|Double| float8[] |Float8| -|Date| date[] |DateDay| -|Timestamp| timestamp[] |DateMilli| -|String| VARCHAR[] |Varchar| -|Bytes| bytes[] |Varbinary| -|BigDecimal| Decimal[] |Decimal| -|**\*ARRAY**| **N/A** |List| - -See Cloudera Impala documentation for conversion between JDBC and database types. - -**\*NOTE**: The aggregate types (ARRAY, MAP, STRUCT, and UNIONTYPE) are not yet supported by Cloudera Impala. Columns of aggregate types are treated as VARCHAR columns in SQL and STRING columns in Java. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits -A partition is represented by one partition column of type varchar. We leverage partitions as columns defined on a Cloudera Impala table, and these columns contains parition on columns information. For a table that does not have partition names, * is returned which is equivalent to a single partition. A partition is equivalent to a split. - -| Name | Type | Description | -|-----------|---------|-------------| -| partition | Varchar |Partition information on table columns| - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. -**Cloudera Impala integration Test suite will not create any Cloudera Impala service and Datasets, instead it will use existing Cloudera Impala databases. - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from -source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-cloudera-impala** dir, run `mvn clean install`. -5. From the **athena-cloudera-impala** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-Impala` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. - -# Performance tuning - -Impala supports Static partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend static partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-cloudera-impala.html). \ No newline at end of file diff --git a/athena-cloudwatch-metrics/README.md b/athena-cloudwatch-metrics/README.md index c8728069be..d5be8b2fe4 100644 --- a/athena-cloudwatch-metrics/README.md +++ b/athena-cloudwatch-metrics/README.md @@ -2,87 +2,4 @@ This connector enables Amazon Athena to communicate with Cloudwatch Metrics, making your metrics data accessible via SQL. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -## Usage - -### Parameters - -The Athena Cloudwatch Metrics Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) - -The connector also supports AIMD Congestion Control for handling throttling events from Cloudwatch via the Athena Query Federation SDK's ThrottlingInvoker construct. You can tweak the default throttling behavior by setting any of the below (optional) environment variables: - -1. **throttle_initial_delay_ms** - (Default: 10ms) This is the initial call delay applied after the first congestion event. -1. **throttle_max_delay_ms** - (Default: 1000ms) This is the max delay between calls. You can derive TPS by dividing it into 1000ms. -1. **throttle_decrease_factor** - (Default: 0.5) This is the factor by which we reduce our call rate. -1. **throttle_increase_ms** - (Default: 10ms) This is the rate at which we decrease the call delay. - - -### Databases & Tables - -The Athena Cloudwatch Metrics Connector maps your Namespaces, Dimensions, Metrics, and Metric Values into two tables in a single schema called "default". - -1. **metrics** - This table contains the available metrics as uniquely defined by a triple of namespace, set, name. More specifically, this table contains the following columns. - - * **namespace** - A VARCHAR containing the namespace. - * **metric_name** - A VARCHAR containing the metric name. - * **dimensions** - A LIST of STRUCTS comprised of dim_name (VARCHAR) and dim_value (VARCHAR). - * **statistic** - A List of VARCH statistics (e.g. p90, AVERAGE, etc..) avialable for the metric. - -1. **metric_samples** - This table contains the available metric samples for each metric named in the **metrics** table. More specifically, the table contains the following columns: - * **namespace** - A VARCHAR containing the namespace. - * **metric_name** - A VARCHAR containing the metric name. - * **dimensions** - A LIST of STRUCTS comprised of dim_name (VARCHAR) and dim_value (VARCHAR). - * **dim_name** - A VARCHAR convenience field used to easily filter on a single dimension name. - * **dim_value** - A VARCHAR convenience field used to easily filter on a single dimension value. - * **period** - An INT field representing the 'period' of the metric in seconds. (e.g. 60 second metric) - * **timestamp** - A BIGINT field representing the epoch time (in seconds) the metric sample is for. - * **value** - A FLOAT8 field containing the value of the sample. - * **statistic** - A VARCHAR containing the statistic type of the sample. (e.g. AVERAGE, p90, etc..) - -### Required Permissions - -Review the "Policies" section of the athena-cloudwatch-metrics.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -2. Cloudwatch Metrics ReadOnly - The connector uses this access to query your metrics data. -2. Cloudwatch Logs Write - The connector uses this access to write its own diagnostic logs. -1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. - -### Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-cloudwatch-metrics dir, run `mvn clean install`. -3. From the athena-cloudwatch-metrics dir, run `../tools/publish.sh S3_BUCKET_NAME athena-cloudwatch-metrics` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) -4. Try running a query like the one below in Athena: -```sql --- Get the list of available metrics -select * from "lambda:"."default".metrics limit 100 - --- Query the last 3 days of AWS/Lambda Invocations metrics -SELECT * -FROM "lambda:"."default".metric_samples -WHERE metric_name = 'Invocations' - AND namespace = 'AWS/Lambda' - AND statistic IN ( 'p90', 'Average' ) - AND period = 60 - AND timestamp BETWEEN To_unixtime(Now() - INTERVAL '3' day) AND - To_unixtime(Now()) -LIMIT 100; -``` - -## Performance - -The Athena Cloudwatch Metrics Connector will attempt to parallelize queries against Cloudwatch Metrics by parallelizing scans of the various metrics needed for your query. Predicate Pushdown is performed within the Lambda function and also within Cloudwatch Logs for certain time period , metric, namespace, and dimension filters. - -## License - -This project is licensed under the Apache-2.0 License. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-cwmetrics.html). \ No newline at end of file diff --git a/athena-cloudwatch/README.md b/athena-cloudwatch/README.md index b844dd127d..768fae0898 100644 --- a/athena-cloudwatch/README.md +++ b/athena-cloudwatch/README.md @@ -2,76 +2,4 @@ This connector enables Amazon Athena to communicate with Cloudwatch, making your log data accessible via SQL. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -## Usage - -### Parameters - -The Athena Cloudwatch Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) - -The connector also supports AIMD Congestion Control for handling throttling events from Cloudwatch via the Athena Query Federation SDK's ThrottlingInvoker construct. You can tweak the default throttling behavior by setting any of the below (optional) environment variables: - -1. **throttle_initial_delay_ms** - (Default: 10ms) This is the initial call delay applied after the first congestion event. -1. **throttle_max_delay_ms** - (Default: 1000ms) This is the max delay between calls. You can derive TPS by dividing it into 1000ms. -1. **throttle_decrease_factor** - (Default: 0.5) This is the factor by which we reduce our call rate. -1. **throttle_increase_ms** - (Default: 10ms) This is the rate at which we decrease the call delay. - - -### Databases & Tables - -The Athena Cloudwatch Connector maps your LogGroups as schemas (aka database) and each LogStream as a table. The connector also maps a special "all_log_streams" View comprised of all LogStreams in the LogGroup. This View allows you to query all the logs in a LogGroup at once instead of search through each LogStream individually. - -Every Table mapped by the Athena Cloudwatch Connector has the following schema which matches the fields provided by Cloudwatch Logs itself. - -1. **log_stream** - A VARCHAR containing the name of the LogStream that the row is from. -2. **time** - An INT64 containing the epoch time of the log line was generated. -3. **message** - A VARCHAR containing the log message itself. - -### Required Permissions - -Review the "Policies" section of the athena-cloudwatch.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -2. CloudWatch Logs Read/Write - The connector uses this access to read your log data in order to satisfy your queries but also to write its own diagnostic logs. -1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. - -### Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -### Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-federation-integ-test dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the athena-cloudwatch dir, run `mvn clean install`. -4. From the athena-cloudwatch dir, run `../tools/publish.sh S3_BUCKET_NAME athena-cloudwatch` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) -5. Try running a query like the one below in Athena: -```sql -select * from "lambda:"."/aws/lambda/".all_log_streams limit 100 -``` - -## Performance - -The Athena Cloudwatch Connector will attempt to parallelize queries against Cloudwatch by parallelizing scans of the various log_streams needed for your query. Predicate Pushdown is performed within the Lambda function and also within Cloudwatch Logs for certain time period filters. - -## License - -This project is licensed under the Apache-2.0 License. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-cloudwatch.html). \ No newline at end of file diff --git a/athena-docdb/README.md b/athena-docdb/README.md index 203dade4c0..570c0aa925 100644 --- a/athena-docdb/README.md +++ b/athena-docdb/README.md @@ -2,109 +2,4 @@ This connector enables Amazon Athena to communicate with your DocumentDB instance(s), making your DocumentDB data accessible via SQL. The also works with any MongoDB compatible endpoint. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -Unlike traditional relational data stores, DocumentDB collections do not have set schema. Each entry can have different fields and data types. While we are investigating the best way to support schema-on-read usecases for this connector, it presently supports two mechanisms for generating traditional table schema information. The default mechanism is for the connector to scan a small number of documents in your collection in order to form a union of all fields and coerce fields with non-overlap data types. This basic schema inference works well for collections that have mostly uniform entries. For more diverse collections, the connector supports retrieving meta-data from the Glue Data Catalog. If the connector sees a database and table which match your DocumentDB database and collection names it will use the corresponding Glue table for schema. We recommend creating your Glue table such that it is a superset of all fields you may want to access from your DocumentDB Collection. - -### Parameters - -The Amazon Athena DocumentDB Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) -6. **disable_glue** - (Optional) If present, with any value except false, the connector will no longer attempt to retrieve supplemental metadata from Glue. -7. **glue_catalog** - (Optional) Can be used to target a cross-account Glue catalog. By default the connector will attempt to get metadata from its own Glue account. -8. **default_docdb** If present, this DocDB connection string is used when there is not a catalog specific environment variable (as explained below). (e.g. mongodb://:@:/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0&readPreference=secondaryPreferred) - -You can also provide one or more properties which define the DocumentDB connection details for the DocumentDB instance(s) you'd like this connector to use. You can do this by setting a Lambda environment variable that corresponds to the catalog name you'd like to use in Athena. For example, if I'd like to query two different DocumentDB instances from Athena in the below queries: - -```sql - select * from "docdb_instance_1".database.table - select * from "docdb_instance_2".database.table - ``` - -To support these two SQL statements we'd need to add two environment variables to our Lambda function: - -1. **docdb_instance_1** - The value should be the DocumentDB connection details in the format of:mongodb://:@:/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0 -2. **docdb_instance_2** - The value should be the DocumentDB connection details in the format of: mongodb://:@:/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0 - -You can also optionally use AWS Secrets Manager for part or all of the value for the preceding connection details. For example, if I set a Lambda environment variable for **docdb_instance_1** to be "mongodb://${docdb_instance_1_creds}@myhostname.com:123/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0" the Athena Federation -SDK will automatically attempt to retrieve a secret from AWS Secrets Manager named "docdb_instance_1_creds" and inject that value in place of "${docdb_instance_1_creds}". Basically anything between ${...} is attempted as a secret in SecretsManager. If no such secret exists, the text isn't replaced. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. - - -### Setting Up Databases & Tables - -To enable a Glue Table for use with DocumentDB, you simply need to have a Glue database and table that matches any DocumentDB Database and Collection that you'd like to supply supplemental metadata for (instead of relying on the DocumentDB Connector's ability to infer schema). The connector's in built schema inference only supports a subset of data types and scans a limited number of documents. You can enable a Glue table to be used for supplemental metadata by setting the below table properties from the Glue Console when editing the Table and database in question. The only other thing you need to do ensure you use the appropriate data types listed in a later section. - -1. **docdb-metadata-flag** - Flag indicating that the table can be used for supplemental meta-data by the Athena DocDB Connector. The value is unimportant as long as this key is present in the properties of the table. - -### Data Types - -The schema inference feature of this connector will attempt to infer values as one of the following: - -|Apache Arrow DataType|Java/DocDB Type| -|-------------|-----------------| -|VARCHAR|String| -|INT|Integer| -|BIGINT|Long| -|BIT|Boolean| -|FLOAT4|Float| -|FLOAT8|Double| -|TIMESTAMPSEC|Date| -|VARCHAR|ObjectId| -|LIST|List| -|STRUCT|Document| - -Alternatively, if you are using Glue for supplimental metadata you can configure the following types: - -|Glue DataType|Apache Arrow Type| -|-------------|-----------------| -|int|INT| -|bigint|BIGINT| -|double|FLOAT8| -|float|FLOAT4| -|boolean|BIT| -|binary|VARBINARY| -|string|VARCHAR| -|List|LIST| -|Struct|STRUCT| - -### Required Permissions - -Review the "Policies" section of the athena-docdb.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -2. SecretsManager Read Access - If you choose to store redis-endpoint details in SecretsManager you will need to grant the connector access to those secrets. -3. Glue Data Catalog - Since DocumentDB does not have a meta-data store, the connector requires Read-Only access to Glue's DataCatalog for supplemental table schema information. -4. VPC Access - In order to connect to your VPC for the purposes of communicating with your DocumentDB instance(s), the connector needs the ability to attach/detach an interface to the VPC. -5. CloudWatch Logs - This is a somewhat implicit permission when deploying a Lambda function but it needs access to cloudwatch logs for storing logs. -1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. - -### Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -### Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-docdb dir, run `mvn clean install`. -3. From the athena-docdb dir, run `../tools/publish.sh S3_BUCKET_NAME athena-docdb` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - - -## Performance - -The Athena DocumentDB Connector does not current support parallel scans but will attempt to push down predicates as part of its DocumentDB queries. - +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-docdb.html). diff --git a/athena-dynamodb/README.md b/athena-dynamodb/README.md index 04d34f213c..4c6e72d03a 100644 --- a/athena-dynamodb/README.md +++ b/athena-dynamodb/README.md @@ -2,111 +2,4 @@ This connector enables Amazon Athena to communicate with DynamoDB, making your tables accessible via SQL. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -## Usage - -### Parameters - -The Athena DynamoDB Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large -responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key -generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. -Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) -6. **disable_glue** - (Optional) If present, with any value except false, the connector will no longer attempt to retrieve supplemental metadata from Glue. -7. **glue_catalog** - (Optional) Can be used to target a cross-account Glue catalog. By default the connector will attempt to get metadata from its own Glue account. -8. **disable_projection_and_casing** - (Optional) Defaults to: auto. Can be used to disable projection and casing in order to be able to query DynamoDB tables with casing in their column names without having to specify a "columnMapping" property on their Glue table. Go here for more info [disable_projection_and_casing](#disable_projection_and_casing). - - -### Setting Up Databases & Tables in Glue - -To enable a Glue Table for use with DynamoDB, you simply need to have a table that matches any DynamoDB Table that you'd like to supply supplemental metadata for (instead of relying on the DynamoDB -Connector's limited ability to infer schema). You can enable a Glue table to be used for supplemental metadata by setting one of the below table properties from the Glue Console when editing the Table in -question. These properties are automatically set if you use Glue's DynamoDB Crawler. The only other thing you need to do is ensure you use the appropriate data types when defining manually or validate -the columns and types that the Crawler discovered. - -1. **dynamodb** - String indicating that the table can be used for supplemental meta-data by the Athena DynamoDB Connector. This string can be in any one of the following places: - 1. in the table properties/parameters under a field called "classification" (exact match). - 2. in the table's storage descriptor's location field (substring match). - 3. in the table's storage descriptor's parameters under a field called "classification" (exact match). -2. **dynamo-db-flag** - String indicating that the *database* contains tables used for supplemental meta-data by the Athena DynamoDB Connector. This is required for any Glue databases other than "default" -and is useful for filtering out irrelevant databases in accounts that have lots of them. This string should be in the Location URI of the Glue Database (substring match). -3. **sourceTable** - Optional table property/parameter that defines the source table name in DynamoDB. Use this if Glue table naming rules prevent you from creating a Glue table with the same name as -your DynamoDB table (e.g. capital letters are not permitted in Glue table names but are permitted in DynamoDB table names). -4. **columnMapping** - Optional table property/parameter that define column name mappings. Use this if Glue column naming rules prevent you from creating a Glue table with the same column names as -your DynamoDB table (e.g. capital letters are not permitted in Glue column names but are permitted in DynamoDB column names). This is expected to be in the format `col1=Col1,col2=Col2`. -5. **defaultTimeZone** - Optional table property/parameter for timezone that will be applied to date/datetime values without explicit timezone. To avoid any discrepancy between the data source default timezone and athena's session timezone, it is good practice to set this value. -6. **datetimeFormatMapping** - Optional table property/parameter that defines the date/datetime format to be used to parse the raw DynamoDB string in a particular column that is of Glue type `date` or `timestamp`. If not provided, the format will inferred using [various ISO-8601 format](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/DateFormatUtils.html). If the date/datetime format cannot be inferred or if the raw string fails to parse, then the value will be omitted from the result. The mapping is expected to be in the format `col1=someformat1,col2=someformat2`. Some examples of the date/datetime formats are `yyyyMMdd'T'HHmmss`, `ddMMyyyy'T'HH:mm:ss`. If your column is of date/datetime value without timezone, and you wish to use the column in the `WHERE` clause, you need to set this optional property for that column. - - -### Required Permissions - -Review the "Policies" section of the athena-dynamodb.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. DynamoDB Read Access - The connector uses the DescribeTable, ListSchemas, ListTables, Query, and Scan APIs. -2. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -3. Glue Data Catalog - Since DynamoDB does not have a meta-data store, the connector requires Read-Only access to Glue's DataCatalog for supplemental table schema information. -4. CloudWatch Logs - This is a somewhat implicit permission when deploying a Lambda function but it needs access to cloudwatch logs for storing logs. -1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. - -### Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -### Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from -source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-dynamodb** dir, run `mvn clean install`. -4. From the **athena-dynamodb** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-dynamodb` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command -is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the -connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -## Performance - -The Athena DynamoDB Connector does support parallel scans and will attempt to push down predicates as part of its DynamoDB queries. A hash key predicate with X distinct values will result in X Query -calls to DynamoDB. All other predicate scenarios will results in Y number of Scan calls where Y is heuristically determined based on the size of your table and its provisioned throughput. - -## Costs - -The costs for use of this solution depends on the underlying AWS resources being used. Pay special attention to [DynamoDB pricing](https://aws.amazon.com/dynamodb/pricing/) since queries using scans can consume a large number of RCU's. - - -## Notes - -If glue is disabled, we perform schema inference. Under schema inference, we evaluate all [int, float, double..etc] to Decimal. If you wish to have correct type, please use glue to declare schema. - -## disable_projection_and_casing -- auto - - This disables projection and casing when we see a previously unsupported type - and we see that the user does not have column name mapping on their table. - - This is the default setting. -- always - - This disables projection and casing unconditionally. - This is useful when users have casing in their ddb column names but do not want to - specify a column name mapping at all. - -Caveats with this new feature: - -- May incur higher bandwidth usage depending on your query. -This not a problem if your lambda is in the same region as your ddb table. - -- Overall latency may increase because there is a larger number of bytes being transferred and also -higher deserialization time given the larger amount of bytes. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-dynamodb.html). diff --git a/athena-elasticsearch/README.md b/athena-elasticsearch/README.md index 2b8c845be2..19818ffbed 100644 --- a/athena-elasticsearch/README.md +++ b/athena-elasticsearch/README.md @@ -5,247 +5,4 @@ making your Elasticsearch data accessible via SQL. This connector will work with Elasticsearch Service as well as any Elasticsearch compatible endpoint configured with `Elasticsearch version 7.0` or higher. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -## Nomenclature - -This document includes descriptions and explanations using Elasticsearch concepts and -terminology: - -* **Domain** - A name this connector uses to associate with the endpoint of your Elasticsearch - instance and is also used as the database name. For Elasticsearch instances - defined within the Amazon Elasticsearch Service, the domain is auto-discoverable. For all - other instances, a mapping between the domain name and endpoint will need to be provided. - -* **Index** - A database table defined in your Elasticsearch instance. - -* **Mapping** - If an index is a database table, then a mapping is its schema (i.e. definitions -of fields/attributes). This connector supports metadata retrieval directly from the -Elasticsearch instance, as well as from the Glue Data Catalog. If the connector finds a Glue -database and table matching your Elasticsearch domain and index names it will attempt to use it -for schema definition. We recommend creating your Glue table such that it is a superset of all -fields defined in your Elasticsearch index. - -* **Document** - A record within a database table. - -## Parameters - -The Amazon Athena Elasticsearch Connector exposes several configuration options via Lambda -environment variables: - -1. **disable_glue** - (Optional) If present, with any value except false, the connector will no longer -attempt to retrieve supplemental metadata from Glue. - -2. **auto_discover_endpoint** - true/false (true is the default value). If you are using Amazon -Elasticsearch Service, having this set to true, allows the connector to auto-discover your -domains and endpoints by calling the appropriate describe/list APIs on Amazon Elasticsearch. -For any other type of Elasticsearch instance (e.g. self-hosted), the associated domain-endpoints -must be specified in the **domain_mapping** variable. This also determines which credentials will -be used to access the endpoint. If **auto_discover_endpoint**=**true**, then AWS credentials will -be used to authenticate to Elasticsearch. Otherwise, username/password credentials retrieved from -Amazon Secrets Manager via the **domain_mapping** variable will be used.* - -3. **domain_mapping** - Used only when **auto_discover_endpoint**=**false**, -this is the mapping between the domain names and their associated endpoints. The variable can -accommodate multiple Elasticsearch endpoints using the following format: -`domain1=endpoint1,domain2=endpoint2,domain3=endpoint3,...` For the purpose of authenticating to -an Elasticsearch endpoint, this connector supports substitution strings injected with the format -`${SecretName}:` with username and password retrieved from AWS Secrets Manager (see example -below).* The colon `:` at the end of the expression serves as a separator from the rest of the -endpoint. - ``` - Example (using secret elasticsearch-creds): - - movies=https://${elasticsearch-creds}:search-movies-ne...qu.us-east-1.es.amazonaws.com - - Will be modified to: - - movies=https://myusername@mypassword:search-movies-ne...qu.us-east-1.es.amazonaws.com - ``` - Each domain-endpoint pair can utilize a different secret. The secret itself must be specified - in the format `username@password`. Although, the password may contain embedded `@` signs, the - first one serves as a separator from the username. It is also important to note that `,` and - `=` are used by this connector as separators for the domain-endpoint pairs. Therefor, they - should **NOT** be used anywhere inside the stored secret. - -4. **query_timeout_cluster** - timeout period (in seconds) for Cluster-Health queries used in the -generation of parallel scans. - -5. **query_timeout_search** - timeout period (in seconds) for Search queries used in the retrieval -of documents from an index. - -6. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, -this is the bucket that the data will be written to for Athena to read the excess from (e.g. -my_bucket). - -7. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called -'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the -above bucket where large responses spill. You should configure an S3 lifecycle on this -location to delete old spills after X days/hours. - -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html - -*To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. - -## Setting Up Databases & Tables - -A Glue table can be set up as a supplemental metadata definition source. To enable -this feature, define a Glue database and table that match the domain and index of the source -you are supplementing. - -Alternatively, this connector will take advantage of metadata definitions stored in the -Elasticsearch instance by retrieving the mapping for the specified index. It is worth noting that -Elasticsearch does not have a dedicated array data-type. Any field can contain zero or more -values so long as they are of the same data-type. If you intend on using Elasticsearch as your -metadata definition source, you will have to define a **_meta** property in all indices used with -Athena to indicate which field(s) should be considered a list (array). Failure to do so will -result in the extraction of only the first element in a list field. When specifying the _meta -property, field names should be fully qualified for nested JSON structures (e.g. `address.street`, -where street is a nested field inside an address structure). - -``` - Example1: - - PUT movies/_mapping - { - "_meta": { - "actor": "list", - "genre": "list" - } - } - - Example2: - Data: - { - "objlistouter": [{ - "objlistinner": [{ - "title": "somebook", - "author": "author" - }], - "field": "field" - }] - } - - PUT movies/_mapping - { - "_meta": { - "objlistouter": "list", - "objlistouter.objlistinner" : "list" - } - } -``` - -### Data Types - -As discussed above, this connector is capable of extracting metadata definitions from either -Glue, or the Elasticsearch instance. Those definitions will be converted to Apache Arrow -data-types using the following table (see NOTES below): - -|**Elasticsearch**|**Apache Arrow**|**Glue** -|-----------------|----------------|------------------| -|text, keyword, binary|VARCHAR|string| -|long|BIGINT|bigint -|scaled_float|BIGINT|SCALED_FLOAT(...) -|integer|INT|int -|short|SMALLINT|smallint -|byte|TINYINT|tinyint -|double|FLOAT8|double| -|float, half_float|FLOAT4|float| -|boolean|BIT|boolean| -|date, date_nanos|DATEMILLI|timestamp -|JSON structure|STRUCT|STRUCT| -|_meta (see above)|LIST|ARRAY| - -NOTES: - -* Only the Elasticsearch/Glue data-types listed above are supported for this connector at -the present time. - -* A **scaled_float** is a floating-point number scaled by a fixed double scaling factor and -represented as a **BIGINT** in Arrow (e.g. 0.756 with a scaling factor of 100 is rounded to 76). - -* To define a scaled_float in Glue you must select the **array** column type and declare the -field using the format `SCALED_FLOAT()`. - - Examples of valid values: - ``` - SCALED_FLOAT(10.51) - SCALED_FLOAT(100) - SCALED_FLOAT(100.0) - ``` - - Examples of invalid values: - ``` - SCALED_FLOAT(10.) - SCALED_FLOAT(.5) - ``` -* When converting from **date_nanos** to **DATEMILLI**, nanoseconds will be rounded to the -nearest millisecond. Valid values for date and date_nanos include but are not limited to: - * "2020-05-18T10:15:30.123456789" - * "2020-05-15T06:50:01.123Z" - * "2020-05-15T06:49:30.123-05:00" - * 1589525370001 (epoch milliseconds) - -* An Elasticsearch **binary** is a string representation of a binary value encoded using Base64, -and will be converted to a **VARCHAR**. - -## Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -## Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and -deploy a pre-built version of this connector. Alternatively, you can build and deploy this -connector from source. To do so, follow the steps below, or use the more detailed tutorial in the -athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-federation-integ-test dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the athena-elasticsearch dir, run `mvn clean install`. -4. From the athena-elasticsearch dir, run `../tools/publish.sh S3_BUCKET_NAME athena-elasticsearch` to publish the connector to your -private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of -the connector's code will be stored and retrieved by the Serverless Application Repository. This -will allow users with permission the ability to deploy instances of the connector via a -1-Click form. -4. Navigate to the [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo). - -## Performance - -The Athena Elasticsearch Connector supports shard-based parallel scans. Using cluster health information -retrieved from the Elasticsearch instance, the connector generates multiple requests (for a document -search query) that are split per shard and run concurrently. - -Additionally, the connector will push down predicates as part of its document search queries. The following -example demonstrates this connector's ability to utilize predicate push-down. - -**Query:** -```sql -select * from "lambda:elasticsearch".movies.movies -where year >= 1955 and year <= 1962 or year = 1996; -``` -**Predicate:** -``` -(_exists_:year) AND year:([1955 TO 1962] OR 1996) -``` - -## Executing SQL Queries - -The following are examples of DDL queries you can send with this connector. Note that -**** corresponds to the name of your Lambda function, **domain** is the name of -the domain you wish to query, and **index** is the name of your index: - -```sql -show databases in `lambda:`; -show tables in `lambda:`.domain; -describe `lambda:`.domain.index; -``` +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-opensearch.html). diff --git a/athena-google-bigquery/README.md b/athena-google-bigquery/README.md index 8e8b071226..aa564d94bf 100644 --- a/athena-google-bigquery/README.md +++ b/athena-google-bigquery/README.md @@ -2,64 +2,4 @@ This connector enables Amazon Athena to communicate with BigQuery, making your BigQuery data accessible. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-google-bigquery/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-google-bigquery/LICENSE.txt). - - -### Parameters - -The Athena Google BigQuery Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -|Parameter Name|Example Value|Description| -|--------------|--------------------|------------------| -|spill_bucket|my_bucket|When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from.| -|spill_prefix|temporary/split| (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours.| -|kms_key_id|a7e63k4b-8loc-40db-a2a1-4d0en2cd8331|(Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys.| -|disable_spill_encryption|True or False|(Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GMC either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption.| -|spill_put_request_headers|""|(Optional) JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html| -|gcp_project_id|semiotic-primer-1234567|The project id (not project name) that contains the datasets that this connector should read from.| -|secret_manager_gcp_creds_name|GoogleCloudPlatformCredentials|The name of the secret within AWS Secrets Manager that contains your BigQuery credentials JSON. The credentials | - -# Partitions and Splits -Currently splits are not based on partitions and based on configured `concurrencyLimit` environment variable. -It will decide the page count for split and uses the limit and offset while executing the query instead of using partition value in the query. -For predicate push down queries no splits are being used since it returns the lesser results. In addition,splits strategy is applicable only -for non predicate push down queries(for select * queries). We can further increase the `concurrencyLimit` as per Google Bigquery Quota limits -configured within Google project. - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. -** Google BigQuery integration Test suite will not create any Google Bigquery service and Datasets, Instead it will use existing Google Bigquery project and datasets. - -### Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-bigquery dir, run `mvn clean install`. -3. From the athena-bigquery dir, run `../tools/publish.sh S3_BUCKET_NAME athena-google-bigquery` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -## Limitations and Other Notes - -* Lambda has a maximum timeout value of 15 mins. Each split executes a query on BigQuery and must finish with enough time to store the results for Athena to read. If the Lambda times out, the query will fail. -* Google BigQuery is case sensitive. We attempt to correct the case of dataset names, and table names but we do not do any case correction for project id's. This is necessary because Presto lower cases all metadata. These corrections will make many extra calls to Google BigQuery. -* Binary, and Complex data types such as Maps, Lists, Structs data types are currently not supported, -* Connector may return Google quota limit issue due to Google bigquery concurrency and quota limits. Ref [Bigquery Quotas and limits ](https://cloud.google.com/bigquery/quotas). As a best practice and in order to avoid Quota limit issues push as many constraints to Google BigQuery - -## Performance - -This connector will attempt to push as many constraints to Google BigQuery to decrease the number of results returned. - -## License - -This project is licensed under the Apache-2.0 License. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-bigquery.html). diff --git a/athena-hbase/README.md b/athena-hbase/README.md index 9aa32ba022..a38821f6a7 100644 --- a/athena-hbase/README.md +++ b/athena-hbase/README.md @@ -2,101 +2,4 @@ This connector enables Amazon Athena to communicate with your HBase instance(s), making your HBase data accessible via SQL. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -Unlike traditional relational data stores, HBase tables do not have set schema. Each entry can have different fields and data types. While we are investigating the best way to support schema-on-read usecases for this connector, it presently supports two mechanisms for generating traditional table schema information. The default mechanism is for the connector to scan a small number of documents in your collection in order to form a union of all fields and coerce fields with non-overlapping data types. This basic schema inference works well for collections that have mostly uniform entries. For more diverse collections, the connector supports retrieving meta-data from the Glue Data Catalog. If the connector sees a Glue database and table which match your HBase namespace and collection names it will use the corresponding Glue table for schema. We recommend creating your Glue table such that it is a superset of all fields you may want to access from your HBase table. - -## Usage - -### Parameters - -The Athena HBase Connector supports several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) -6. **disable_glue** - (Optional) If present, with any value except false, the connector will no longer attempt to retrieve supplemental metadata from Glue. -7. **glue_catalog** - (Optional) Can be used to target a cross-account Glue catalog. By default the connector will attempt to get metadata from its own Glue account. -8. **default_hbase** If present, this HBase connection string (e.g. master_hostname:hbase_port:zookeeper_port) is used when there is not a catalog specific environment variable (as explained below). - -You can also provide one or more properties which define the HBase connection details for the HBase instance(s) you'd like this connector to use. You can do this by setting a Lambda environment variable that corresponds to the catalog name you'd like to use in Athena. For example, if I'd like to query two different HBase instances from Athena in the below queries: - -```sql - select * from "hbase_instance_1".database.table - select * from "hbase_instance_2".database.table - ``` - -To support these two SQL statements we'd need to add two environment variables to our Lambda function: - -1. **hbase_instance_1** - The value should be the HBase connection details in the format of: master_hostname:hbase_port:zookeeper_port -2. **hbase_instance_2** - The value should be the HBase connection details in the format of: master_hostname:hbase_port:zookeeper_port - -You can also optionally use SecretsManager for part or all of the value for the preceding connection details. For example, if I set a Lambda environment variable for **hbase_instance_1** to be "${hbase_host_1}:${hbase_master_port_1}:${hbase_zookeeper_port_1}" the Athena Federation SDK will automatically attempt to retrieve a secret from AWS SecretsManager named "hbase_host_1" and inject that value in place of "${hbase_host_1}". It wil do the same for the other secrets: hbase_zookeeper_port_1, hbase_master_port_1. Basically anything between ${...} is attempted as a secret in SecretsManager. If no such secret exists, the text isn't replaced. -To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. - -### Setting Up Databases & Tables - -To enable a Glue Table for use with HBase, you simply need to have a Glue database and table that matches any HBase Namespace and Table that you'd like to supply supplemental metadata for (instead of relying on the HBase Connector's ability to infer schema). The connector's in built schema inference only supports values serialized in HBase as Strings (e.g. String.valueOf(int)). You can enable a Glue table to be used for supplemental metadata by seting the below table properties from the Glue Console when editing the Table in question. The only other thing you need to do ensure you use the appropriate data types and, optionally, HBase column family naming conventions. - -1. **hbase-metadata-flag** - Flag indicating that the table can be used for supplemental meta-data by the Athena HBase Connector. The value is unimportant as long as this key is present in the properties of the table. -1. **hbase-native-storage-flag** - This flag toggles the two modes of value serialization supported by the connector. By default (when this field is not present) the connector assumes all values are stored in HBase as strings. As such it will attempt to parse INT, BIGINT, DOUBLE, etc.. from HBase as Strings. If this field is set (the value of the table property doesn't matter, only its presence) on the table in Glue, the connector will switch to 'native' storage mode and attempt to read INT, BIGINT, BIT, and DOUBLE as bytes by using ByteBuffer.wrap(value).getInt(), ByteBuffer.wrap(value).getLong(), ByteBuffer.wrap(value).get(), and ByteBuffer.wrap(value).getDouble(). - -When it comes to setting your columns, you have two choices for how you model HBase column families. The Athena HBase connector supports fully qualified (aka flattened) naming like "family:column" as well as using STRUCTS to model your column families. In the STRUCT model the name of the STRUCT field should match the column family and then any children of that STRUCT should match the names of the columns in that family. Since predicate push down and columnar reads are not yet fully supported for complex types like STRUCTs we recommend against using the STRUCT approach unless your usecase specifically requires the use of STRUCTS. The below image shows how we've configured a table in Glue using a combination of these approaches. - - ![Glue Example Image](https://github.com/awslabs/aws-athena-query-federation/blob/master/docs/img/hbase_glue_example.png?raw=true) - -### Data Types - -All HBase values are retrieved as the basic byte type. From there they are converted to one of the below Apache Arrow data types used by the Athena Query Federation SDK based on how you've defined your table(s) in Glue's DataCatalog. If you are not using Glue to supplement your metedata and instead depending on the connector's schema inference capabilities, only a subset of the below data types will be used, namely: BIGINT, FLOAT8, VARCHAR. - -|Glue DataType|Apache Arrow Type| -|-------------|-----------------| -|int|INT| -|bigint|BIGINT| -|double|FLOAT8| -|float|FLOAT4| -|boolean|BIT| -|binary|VARBINARY| -|string|VARCHAR| - - -### Required Permissions - -Review the "Policies" section of the athena-hbase.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -2. SecretsManager Read Access - If you choose to store HBase endpoint details in SecretsManager you will need to grant the connector access to those secrets. -3. Glue Data Catalog - Since HBase does not have a meta-data store, the connector requires Read-Only access to Glue's DataCatalog for obtaining HBase key to table/column mappings. -4. VPC Access - In order to connect to your VPC for the purposes of communicating with your HBase instance(s), the connector needs the ability to attach/detach an interface to the VPC. -5. CloudWatch Logs - This is a somewhat implicit permission when deploying a Lambda function but it needs access to cloudwatch logs for storing logs. -1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. - -### Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -### Deploying The Connector - -To use the Amazon Athena HBase Connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-hbase dir, run `mvn clean install`. -3. From the athena-hbase dir, run `../tools/publish.sh S3_BUCKET_NAME athena-hbase` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -## Performance - -The Athena HBase Connector will attempt to parallelize queries against your HBase instance by reading each region server in parallel. Predicate Pushdown is performed within the Lambda function and, where possible, push down into HBase using filters. - -## License - -This project is licensed under the Apache-2.0 License. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-hbase.html). diff --git a/athena-hortonworks-hive/README.md b/athena-hortonworks-hive/README.md index b05ecfad69..46230ac8c4 100644 --- a/athena-hortonworks-hive/README.md +++ b/athena-hortonworks-hive/README.md @@ -2,178 +2,4 @@ This connector enables Amazon Athena to access your Hortonworks Hive databases. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hortonworks-hive/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hortonworks-hive/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Hortonworks Hive Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `hive://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|HiveMuxCompositeHandler| -|Metadata Handler|HiveMuxMetadataHandler| -|Record Handler|HiveMuxRecordHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is myhivecatalog then the environment variable name should be myhivecatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Hive Mux Lambda function that supports two database instances, hive1host(default) and hive2host: - -|Property|Value| -|---|---| -|default|hive://jdbc:hive2://hive1host:10000/default?${Test/RDS/hive1host}| -| | | -|hive2_catalog1_connection_string|hive://jdbc:hive2://hive1host:10000/default?${Test/RDS/hive1host}| -| | | -|hive2_catalog2_connection_string|hive://jdbc:hive2://hive2host:10000/default?UID=sample&PWD=sample| - -Hortonworks Hive Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -hive://jdbc:hive2://hive1host:10000/default?...&${Test/RDS/hive1host}&... -``` - -will be modified to: - -``` -hive://jdbc:hive2://hive1host:10000/default?...&UID=sample2&PWD=sample2&... -``` - -Secret Name `Test/RDS/hive1host` will be used to retrieve secrets. - -Currently Hortonworks Hive recognizes `UID` and `PWD` JDBC properties. - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single Hortonworks Hive instance. -``` -Composite Handler HiveCompositeHandler -Metadata Handler HiveMetadataHandler -Record Handler HiveRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single Hortonworks Hive instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|hive://jdbc:hive2://hive1host:10000/default?secret=${Test/RDS/hive1host}| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -|Jdbc| *Hortonworks Hive[] |Arrow| -| ---|------------------|---| -|Boolean| Boolean[] |Bit| -|Integer| TINYINT[] |Tiny| -|Short| SMALLINT[] |Smallint| -|Integer| INT[] |Int| -|Long| BIGINT[] |Bigint| -|float| float4[] |Float4| -|Double| float8[] |Float8| -|Date| date[] |DateDay| -|Timestamp| timestamp[] |DateMilli| -|String| VARCHAR[] |Varchar| -|Bytes| bytes[] |Varbinary| -|BigDecimal| Decimal[] |Decimal| -|**\*ARRAY**| **N/A** |List| - -See Hortonworks Hive documentation for conversion between JDBC and database types. - -**\*NOTE**: The aggregate types (ARRAY, MAP, STRUCT, and UNIONTYPE) are not yet supported by Hortonworks Hive. Columns of aggregate types are treated as VARCHAR columns in SQL and STRING columns in Java. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits -A partition is represented by one partition column of type varchar. We leverage partitions as columns defined on a Hortonworks Hive table, and these columns contains parition on columns information. For a table that does not have partition names, * is returned which is equivalent to a single partition. A partition is equivalent to a split. - -| Name | Type | Description | -|-----------|---------|-------------| -| partition | Varchar |Partition information on table columns| - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. -**Hortonworks Hive integration Test suite will not create any Hortonworks Hive service and Datasets, instead it will use existing Hortonworks Hive databases. - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-hortonworks-hive** dir, run `mvn clean install`. -5. From the **athena-hortonworks-hive** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-hive2` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. - -# Performance tuning - -Hive supports Static partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend static partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-hortonworks.html). diff --git a/athena-mysql/README.md b/athena-mysql/README.md index 4ee81d1438..a3fe9bf313 100644 --- a/athena-mysql/README.md +++ b/athena-mysql/README.md @@ -2,177 +2,4 @@ This connector enables Amazon Athena to access your MySQL databases. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The MySQL Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `mysql://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|MySqlMuxCompositeHandler| -|Metadata Handler|MySqlMuxCompositeHandler| -|Record Handler|MySqlMuxCompositeHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is mysqlcatalog then the environment variable name should be mysqlcatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Mux Lambda function that supports two database instances mysql1(default) and mysql2: - -|Property|Value| -|---|---| -|default|mysql://jdbc:mysql://mysql2.host:3333/default?user=sample2&password=sample2| -| | | -|mysql_catalog1_connection_string|mysql://jdbc:mysql://mysql1.host:3306/default?${Test/RDS/MySql1}| -| | | -|mysql_catalog2_connection_string|mysql://jdbc:mysql://mysql2.host:3333/default?user=sample2&password=sample2| - -MySQL Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -mysql://jdbc:mysql://mysql1.host:3306/default?...&${Test/RDS/MySql1}&... -``` - -will be modified to: - -``` -mysql://jdbc:mysql://mysql1.host:3306/default?...&user=sample2&password=sample2&... -``` - -Secret Name `Test/RDS/MySql1` will be used to retrieve secrets. - -Currently MySQL recognizes `user` and `password` JDBC properties. - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single MySQL instance. -``` -Composite Handler MySqlCompositeHandler -Metadata Handler MySqlMetadataHandler -Record Handler MySqlRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single MySql instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|mysql://mysql1.host:3306/default?secret=Test/RDS/MySql1| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -|Jdbc|Arrow| -| ---|---| -|Boolean|Bit -|Integer|Tiny -|Short|Smallint -|Integer|Int -|Long|Bigint -|float|Float4 -|Double|Float8 -|Date|DateDay -|Timestamp|DateMilli -|String|Varchar -|Bytes|Varbinary -|BigDecimal|Decimal -|ARRAY|List| - -See MySQL documentation for conversion between JDBC and database types. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits -A partition is represented by a single partition column of type varchar. We leverage partitions defined on a MySql table, and this column contains partition names. For a table that does not have partition names, * is returned which is equivalent to a single partition. A partition is equivalent to a split. - -|Name|Type|Description -|---|---|---| -|partition_name|Varchar|Named partition in MySql. E.g. p0| - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from -source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-mysql** dir, run `mvn clean install`. -5. From the **athena-mysql** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-mysql` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. -* Athena converts queries to lower case. MySQL table names need to be in lower case to match. For example, Athena queries against "myTable" will fail. - -# Performance tuning - -MySql supports native partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend native partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-mysql.html). diff --git a/athena-neptune/README.md b/athena-neptune/README.md index 6aaf524256..f1d9a4fd53 100644 --- a/athena-neptune/README.md +++ b/athena-neptune/README.md @@ -2,32 +2,4 @@ This connector enables Amazon Athena to communicate with your Neptune Graph Database instance, making your Neptune graph data accessible via SQL. -**Athena Federated Queries are now enabled as GA in the following regions: us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -## Steps to setup the connector - -### Setup Neptune Cluster (Optional) -You can skip this step if you’ve an existing Amazon Neptune cluster and property graph dataset in it that you would like to use. Ensure to have an Internet Gateway and NAT Gateway in the VPC hosting your Neptune cluster and the private subnets in which the Amazon Athena Neptune Connector Lambda function will be running should have route to the internet via this NAT Gateway. This NAT Gateway will be later on used by the Amazon Athena Neptune Connector Lambda function to talk to AWS Glue. - -For detailed instructions on setting up a new Neptune cluster and loading the sample property graph air routes dataset into it, follow the steps mentioned [here](./docs/neptune-cluster-setup). -### Setup AWS Glue Catalog - -Unlike traditional relational data stores, Neptune graph DB nodes and edges do not have set schema. Each entry can have different fields and data types. While we are investigating the best way to support schema-on-read usecases for this connector, it presently supports retrieving meta-data from the Glue Data Catalog. You need to pre-create the Glue Database and the corresponding Glue tables with required schemas within that database. This allows the connector to populate list of tables available to query within Athena. - -Refer to the sample Glue catalog setup [here](./docs/aws-glue-sample-scripts). - -### Deploy the Neptune Athena Connector - -Once you have created the glue catalog, follow steps [here](./docs/neptune-connector-setup) to setup the Athena Neptune Connector - -Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-neptune dir, run `mvn clean install`. -3. From the athena-neptune dir, run `../tools/publish.sh S3_BUCKET_NAME athena-neptune` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -## Current Limitations - -The connector currently supports only Property Graph model and does not support RDF Graphs yet. - - +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-neptune.html). diff --git a/athena-oracle/README.md b/athena-oracle/README.md index d79622a4f4..121a6ea44f 100644 --- a/athena-oracle/README.md +++ b/athena-oracle/README.md @@ -2,178 +2,4 @@ This connector enables Amazon Athena to access your Oracle databases. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-oracle/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-oracle/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Oracle Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `oracle://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -| Handler | Class | -|-------------------|----------------------------| -| Composite Handler | OracleMuxCompositeHandler | -| Metadata Handler | OracleMuxMetadataHandler | -| Record Handler | OracleMuxRecordHandler | - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is myoraclecatalog then the environment variable name should be myoraclecatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Oracle Mux Lambda function that supports two database instances, oracle1(default) and oracle2: - -| Property | Value | -|-----------------------------------|-----------------------------------------------------------------------------------| -| default | oracle://jdbc:oracle:thin:${Test/RDS/Oracle1}@//oracle1.hostname:port/servicename | -| oracle_catalog1_connection_string | oracle://jdbc:oracle:thin:${Test/RDS/Oracle2}@//oracle1.hostname:port/servicename | -| oracle_catalog2_connection_string | oracle://jdbc:oracle:thin:${Test/RDS/Oracle2}@//oracle2.hostname:port/servicename | - -Oracle Connector supports substitution of any string enclosed like *${Test/RDS/Oracle}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -oracle://jdbc:oracle:thin:${Test/RDS/Oracle}@//hostname:port/servicename -``` - -will be modified to: - -``` -oracle://jdbc:oracle:thin:username/password@//hostname:port/servicename -``` - -Secret Name `Test/RDS/Oracle` will be used to retrieve secrets. - -Currently Oracle recognizes without key `user` and `password` JDBC properties. It will take username and password like username/password without any key `user` and `password` - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single Oracle instance. -``` -Composite Handler OracleCompositeHandler -Metadata Handler OracleMetadataHandler -Record Handler OracleRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -The current version of oracle connector supports ssl based connection for Amazon RDS instances only. - - Specifically support for TLS and only for the authentication of the server by the client (no mutual auth support since RDS does not support it). - -**Example property for a single oracle instance supported by a Lambda function:** - -| Property | Value | -|----------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| default | oracle://jdbc:oracle:thin:${Test/RDS/Oracle}@//hostname:port/servicename (or) | -| | oracle://jdbc:oracle:thin:${Test/RDS/Oracle}@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCPS)(HOST=)(PORT=))(CONNECT_DATA=(SID=))(SECURITY=(SSL_SERVER_CERT_DN=))) | - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -| Jdbc | *Oracle[] | Arrow | -|-------------|----------------|-----------| -| Boolean | boolean[] | Bit | -| Integer | **N/A** | Tiny | -| Short | smallint[] | Smallint | -| Integer | integer[] | Int | -| Long | bigint[] | Bigint | -| float | float4[] | Float4 | -| Double | float8[] | Float8 | -| Date | date[] | DateDay | -| Timestamp | timestamp[] | DateMilli | -| String | text[] | Varchar | -| Bytes | bytea[] | Varbinary | -| BigDecimal | numeric(p,s)[] | Decimal | -| **\*ARRAY** | **N/A** | List | - -See Oracle documentation for conversion between JDBC and database types. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits -A partition is represented by a single partition column of type varchar. We leverage partitions defined on a Oracle table, and this column contains partition names. For a table that does not have partition names, * is returned which is equivalent to a single partition. A partition is equivalent to a split. - -| Name | Type | Description | -|----------------|---------|----------------------------------------------------------| -| PARTITION_NAME | Varchar | Named partition in Oracle. E.g. P_2006_DEC,P_2006_AUG .. | - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. -** Oracle integration Test suite will not create any Oracle RDS instances and Database Schemas, Instead it will use existing Oracle RDS or EC2 instances. - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-oracle** dir, run `mvn clean install`. -5. From the **athena-oracle** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-oracle` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. - -# Performance tuning - -Oracle supports native partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend native partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-oracle.html). diff --git a/athena-postgresql/README.md b/athena-postgresql/README.md index 821cc5d720..2af0dbe504 100644 --- a/athena-postgresql/README.md +++ b/athena-postgresql/README.md @@ -2,181 +2,4 @@ This connector enables Amazon Athena to access your PostgreSQL databases. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The PostgreSQL Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `postgres://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|PostGreSqlMuxCompositeHandler| -|Metadata Handler|PostGreSqlMuxMetadataHandler| -|Record Handler|PostGreSqlMuxRecordHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is mypostgrescatalog then the environment variable name should be mypostgrescatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a PostgreSQL Mux Lambda function that supports two database instances, postgres1(default) and postgres2: - -|Property|Value| -|---|---| -|default|postgres://jdbc:postgresql://postgres1.host:5432/default?${Test/RDS/PostGres1}| -| | | -|postgres_catalog1_connection_string|postgres://jdbc:postgresql://postgres1.host:5432/default?${Test/RDS/PostGres1}| -| | | -|postgres_catalog2_connection_string|postgres://jdbc:postgresql://postgres2.host:5432/default?user=sample&password=sample| - -PostgreSQL Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -postgres://jdbc:postgres://postgres1.host:3306/default?...&${Test/RDS/PostGres1}&... -``` - -will be modified to: - -``` -postgres://jdbc:postgres://postgres1.host:3306/default?...&user=sample2&password=sample2&... -``` - -Secret Name `Test/RDS/PostGres1` will be used to retrieve secrets. - -Currently PosgreSQL recognizes `user` and `password` JDBC properties. - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single PostgreSQL instance. -``` -Composite Handler PostGreSqlCompositeHandler -Metadata Handler PostGreSqlMetadataHandler -Record Handler PostGreSqlRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single PostgreSQL instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|postgres://jdbc:postgresql://postgres1.host:3306/default?secret=${Test/RDS/PostgreSQL1}| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -|Jdbc|*PostGreSQL[]|Arrow| -| ---|---|---| -|Boolean|boolean[]|Bit -|Integer|**N/A**|Tiny -|Short|smallint[]|Smallint -|Integer|integer[]|Int -|Long|bigint[]|Bigint -|float|float4[]|Float4 -|Double|float8[]|Float8 -|Date|date[]|DateDay -|Timestamp|timestamp[]|DateMilli -|String|text[]|Varchar -|Bytes|bytea[]|Varbinary -|BigDecimal|numeric(p,s)[]|Decimal -|**\*ARRAY**|**N/A**|List| - -See PostgreSQL documentation for conversion between JDBC and database types. - -**\*NOTE**: ARRAY type is supported for the PostGreSQL connector with the following constraints: -* Multi-dimensional arrays (`[][]`, or nested arrays) are **NOT** supported. -* Columns with unsupported ARRAY data-types will be converted to array of string elements (i.e. `array`). - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits -A partition is represented by two partition columns of type varchar. We leverage partitions as child tables defined on a PostGres table, and these columns contain child schema and child table information. For a table that does not have partition names, * is returned which is equivalent to a single partition. A partition is equivalent to a split. - -|Name|Type|Description -|---|---|---| -|partition_schema|Varchar|Child table schema name| -|partition_name|Varchar|Child table name| - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from -source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-postgres** dir, run `mvn clean install`. -5. From the **athena-postgres** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-postgres` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. - -# Performance tuning - -PostGreSql supports native partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend native partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-postgresql.html). diff --git a/athena-redis/README.md b/athena-redis/README.md index bba0d1a6dd..6672bfa0b9 100644 --- a/athena-redis/README.md +++ b/athena-redis/README.md @@ -2,91 +2,4 @@ This connector enables Amazon Athena to communicate with your Redis instance(s), making your Redis data accessible via SQL. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -Unlike traditional relational data stores, Redis does not have the concept of a table or a column. Instead, Redis offers key-value access patterns where the key is essentially a 'string' and the value is one of: string, z-set, hmap. The Athena Redis Connector allows you to configure virtual tables using the Glue Data Catalog for schema and special table properties to tell the Athena Redis Connector how to map your Redis key-values into a table. You can read more on this below in the 'Setting Up Tables Section'. - - -## Usage - -### Parameters - -The Athena Redis Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) -6. **glue_catalog** - (Optional) Can be used to target a cross-account Glue catalog. By default the connector will attempt to get metadata from its own Glue account. - -### Setting Up Databases & Tables - -To enable a Glue Table for use with Redis, you can set the following properties on the Table. redis-endpoint , redis-value-type, and one of redis-keys-zset or redis-key-prefix. Also note that any Glue database which may contain redis tables should have "redis-db-flag" somewhere in the URI property of the Database. You can set this from the Glue Console by editing the database. - -1. **redis-endpoint** - (required) The hostname:port:password of the redis server that data for this table should come from. (e.g. athena-federation-demo.cache.amazonaws.com:6379) Alternatively, you can store the endpoint or part of the endpoint in AWS Secrets Manager by using ${secret_name} as the table property value.* -2. **redis-keys-zset** - (required if not using # 3) A comma separated list of keys whose value is a zset. Each of the values in the zset is then treated as a key that is part of this table. You must set either this or redis-key-prefix. (e.g. active-orders,pending-orders) -3. **redis-key-prefix** - (required if not using # 2) A comma separated list of key prefixes to scan for values that should be part of this table. You must set either this or redis-keys-zset on the table. (e.g. accounts-*,acct-) -4. **redis-value-type** - (required) Defines how the value for the keys defined by either redis-key-prefix or redis-keys-zset will be mapped to your table. literal maps to a single column. zset also maps to a single column but each key can essentially store N rows. hash allows for each key to be a row with multiple columns. (e.g. hash or literal or zset) -5. **redis-ssl-flag** - (optional) Defaults to False, setting this to True will create a redis connection with SSL/TLS. (e.g. True or False) -6. **redis-cluster-flag** - (optional) Defaults to False, setting this to True will enable support for clustered Redis instances. (e.g. True or False) -7. **redis-db-number** - (optional, only applies for standalone NOT clustered) Defaults to redis logical database 0, set this to read from a non default redis database. (e.g. 1,2,3,...)** - -*To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. - -**This does not refer to a database in Athena/Glue, but a Redis logical database. Refer to [SELECT index](https://redis.io/commands/select) for more information. - -### Data Types - -All Redis values are retrieved as the basic String data type. From there they are converted to one of the below Apache Arrow data types used by the Athena Query Federation SDK based on how you've defined your table(s) in Glue's DataCatalog. - -|Glue DataType|Apache Arrow Type| -|-------------|-----------------| -|int|INT| -|string|VARCHAR| -|bigint|BIGINT| -|double|FLOAT8| -|float|FLOAT4| -|smallint|SMALLINT| -|tinyint|TINYINT| -|boolean|BIT| -|binary|VARBINARY| - -### Required Permissions - -Review the "Policies" section of the athena-redis.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -2. SecretsManager Read Access - If you choose to store redis-endpoint details in SecretsManager you will need to grant the connector access to those secrets. -3. Glue Data Catalog - Since Redis does not have a meta-data store, the connector requires Read-Only access to Glue's DataCatalog for obtaining Redis key to table/column mappings. -4. VPC Access - In order to connect to your VPC for the purposes of communicating with your Redis instance(s), the connector needs the ability to attach/detach an interface to the VPC. -5. CloudWatch Logs - This is a somewhat implicit permission when deploying a Lambda function but it needs access to cloudwatch logs for storing logs. -1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. - -### Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -### Deploying The Connector - -To use the Amazon Athena Redis Connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-redis dir, run `mvn clean install`. -3. From the athena-redis dir, run `../tools/publish.sh S3_BUCKET_NAME athena-redis` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -## Performance - -The Athena Redis Connector will attempt to parallelize queries against your Redis instance depending on the type of table you've defined (zset keys vs. prefix keys). Predicate Pushdown is performed within the Lambda function. - -## License - -This project is licensed under the Apache-2.0 License. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-redis.html). diff --git a/athena-redshift/README.md b/athena-redshift/README.md index c1568dcf0f..95b4bf1361 100644 --- a/athena-redshift/README.md +++ b/athena-redshift/README.md @@ -2,170 +2,4 @@ This connector enables Amazon Athena to access your Redshift database using JDBC driver. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Redshift Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `redshift://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|RedshiftMuxCompositeHandler| -|Metadata Handler|RedshiftMuxMetadataHandler| -|Record Handler|RedshiftMuxRecordHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is myredshiftcatalog then the environment variable name should be myredshiftcatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Mux Lambda function that supports two database instances, redshift1(default) and redshift2: - -|Property|Value| -|---|---| -|default|redshift://jdbc:redshift://redshift1.host:5439/dev?user=sample2&password=sample2| -| | | -|redshift_catalog1_connection_string|redshift://jdbc:redshift://redshift1.host:3306/default?${Test/RDS/Redshift1}| -| | | -|redshift_catalog2_connection_string|redshift://jdbc:redshift://redshift2.host:3333/default?user=sample2&password=sample2| - -Redshift Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -redshift://jdbc:redshift://redshift1.host:3306/default?...&${Test/RDS/Redshift1}&... -``` - -will be modified to: - -``` -redshift://jdbc:redshift://redshift1.host:3306/default?...&user=sample2&password=sample2&... -``` - -Secret Name `Test/RDS/Redshift1` will be used to retrieve secrets. - -Currently Redshift recognizes `user` and `password` JDBC properties. - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single PostgreSQL instance. -``` -Composite Handler RedshiftCompositeHandler -Metadata Handler RedshiftMetadataHandler -Record Handler RedshiftRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single Redshift instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|redshift://redshift1.host:3306/default?secret=Test/RDS/Redshift1| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -|Jdbc|Arrow| -| ---|---| -|Boolean|Bit -|Integer|Tiny -|Short|Smallint -|Integer|Int -|Long|Bigint -|float|Float4 -|Double|Float8 -|Date|DateDay -|Timestamp|DateMilli -|String|Varchar -|Bytes|Varbinary -|BigDecimal|Decimal -|ARRAY|List| - -See Redshift documentation for conversion between JDBC and database types. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits - -**Note:** Redshift does not support external partitions. Performance with huge datasets is slow. - -### Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -### Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from -source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-redshift** dir, run `mvn clean install`. -5. From the **athena-redshift** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-redshift` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. -* Redshift does not support external partitions so all data will be retrieved every time. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-redshift.html). diff --git a/athena-saphana/README.md b/athena-saphana/README.md index 574ba7bd08..ed361c75e6 100644 --- a/athena-saphana/README.md +++ b/athena-saphana/README.md @@ -2,196 +2,4 @@ This connector enables Amazon Athena to access your SAP HANA SQL database or RDS instance(s) using JDBC driver. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-saphana/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-saphana/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The SAP HANA Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `saphana://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|SaphanaMuxCompositeHandler| -|Metadata Handler|SaphanaMuxCompositeHandler| -|Record Handler|SaphanaMuxCompositeHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is saphanacatalog then the environment variable name should be saphanacatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a SAP HANA Mux Lambda function that supports two database instances saphana1(default) and saphana2: - -|Property|Value| -|---|---| -|default|saphana://jdbc:saphana://saphana1.host:port/?${Test/RDS/Saphana1}| -| | | -|saphana_catalog1_connection_string|saphana://jdbc:sap://saphana1.host:port/?${Test/RDS/Saphana1}| -| | | -|saphana_catalog2_connection_string|saphana://jdbc:sap://saphana2.host:port/?user=sample2&password=sample2| - -SAP HANA Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -saphana://jdbc:sap://saphana1.host:port/?${Test/RDS/Saphana1}&... -``` - -will be modified to: - -``` -saphana://jdbc:sap://saphana2.host:port/?user=sample2&password=sample2&... -``` - -Secret Name `Test/RDS/Saphana1` will be used to retrieve secrets. - -Currently SAP HANA recognizes `user` and `password` JDBC properties. It will take username and password like username/password without any key `user` and `password` - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single SAP HANA instance. -``` -Composite Handler SaphanaCompositeHandler -Metadata Handler SaphanaMetadataHandler -Record Handler SaphanaRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single SAP HANA instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|saphana://jdbc:sap://saphana1.host:port/?secret=Test/RDS/Saphana1| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -|Jdbc|Arrow| -| ---|---| -|Boolean|Bit| -|Integer|Tiny| -|Short|Smallint| -|Integer|Int| -|Long|Bigint| -|float|Float4| -|Double|Float8| -|Date|DateDay| -|Timestamp|DateMilli| -|String|Varchar| -|Bytes|Varbinary| -|BigDecimal|Decimal| -|ARRAY|List| - -See SAP HANA documentation for conversion between JDBC and database types. - -# Data Types Conversion -In order to make the source (SAP HANA) and Athena data types compatible, there are number of data types conversion which are being included in the HANA connector. This is in addition to JDBC ARROW conversion which are not covered. The purpose of these conversions is to make sure the mismatches on the type of data types have been addressed so that queries do get executed successfully. The details of the data types which have been converted are as below: - -|Source Data Type (SAP HANA)|Converted Data Type (Athena)| -| ---|---| -|DECIMAL|BIGINT| -|INTEGER|INT| -|DATE|DATEDAY| -|TIMESTAMP|DATEMILLI| - -In addition, all the unsupported data types are getting converted to VARCHAR. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits -A partition is represented by a single partition column of type Integer. We leverage partitions defined on a Saphana table, and this column contains partition names. For a table that does not have partition names, is considered as single partition. A partition is equivalent to a split. - - -|Name|Type|Description| -|---|---|---| -|PART_ID|Integer|Named partition in SAP HANA. E.g. 1,2,3| - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -_**_ SAP HANA integration Test suite will not create any SAP HANA RDS instances and Database Schemas, Instead it will use existing SAP HANA instances. - - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-saphana** dir, run `mvn clean install`. -5. From the **athena-saphana** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-saphana` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. -* In SAP HANA, object names are converted to uppercase when they are stored in the SAP HANA database. However, if you enclose a name in quotation marks, it is case sensitive and this implicitly states that 2 different tables can have same name in lower and upper case i.e. 1. EMPLOYEE and 2. employee. In AFQ, the Schema/Table names are pushed to the lambda function in lower case. - In order to handle this issue and as a work around for this specific scenario, users are expected to provide query hints to retrieve the data from the tables where the name is case sensitive. Query hints can be avoided for rest of the scenarios. - Here are the sample queries with query hints. -  SELECT * FROM "lambda:saphanaconnector".SYSTEM."MY_TABLE@schemaCase=upper&tableCase=upper” -  SELECT * FROM "lambda:saphanaconnector".SYSTEM."MY_TABLE@schemaCase=upper&tableCase=lower” - -# Performance tuning - -Saphana supports native partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend native partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-sap-hana.html). diff --git a/athena-snowflake/README.md b/athena-snowflake/README.md index e3f5369a39..65128b66c7 100644 --- a/athena-snowflake/README.md +++ b/athena-snowflake/README.md @@ -2,235 +2,4 @@ This connector enables Amazon Athena to access your Snowflake SQL database or RDS instance(s) using JDBC driver. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-snowflake/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-snowflake/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Snowflake Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `snowflake://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|SnowflakeMuxCompositeHandler| -|Metadata Handler|SnowflakeMuxCompositeHandler| -|Record Handler|SnowflakeMuxCompositeHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is snowflakecatalog then the environment variable name should be snowflakecatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Snowflake Mux Lambda function that supports two database instances snowflake1(default) and snowflake2: - -|Property|Value| -|---|---| -|default|```snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1&${Test/RDS/Snowflake1}```| -| | | -|snowflake_catalog1_connection_string|```snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1${Test/RDS/Snowflake1}```| -| | | -|snowflake_catalog2_connection_string|```snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1${Test/RDS/Snowflake1}```| -| | | -|snowflake_catalog3_connection_string|```snowflake://jdbc:snowflake://snowflake2.host:port/?warehouse=warehousename&db=db1&schema=schema1&user=sample2&password=sample2```| - -Snowflake Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1${Test/RDS/Snowflake1}&... -``` - -will be modified to: - -``` -snowflake://jdbc:snowflake://snowflake2.host:port/?warehouse=warehousename&db=db1&schema=schema1&user=sample2&password=sample2&... -``` - -Secret Name `Test/RDS/Snowflake1` will be used to retrieve secrets. - -Currently Snowflake recognizes `user` and `password` JDBC properties.It will take username and password like username/password without any key `user` and `password` - -Please note, If user has provided schema details in connection string, the data source will restrict to tables in that particular schema, If user has not provided schema details, then data source -will display tables present in the database that has been mentioned in connection string Example: - -``` - snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1${Test/RDS/Snowflake1} -``` -Above will load tables present in schema1 and not other schemas present in db1. - -``` -snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1${Test/RDS/Snowflake1} -``` -Above will load tables present in db1. - -### References - -``` -https://docs.snowflake.com/en/user-guide/jdbc-configure.html#jdbc-driver-connection-string -``` - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single Snowflake instance. -``` -Composite Handler SnowflakeCompositeHandler -Metadata Handler SnowflakeMetadataHandler -Record Handler SnowflakeRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single Snowflake instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|```snowflake://jdbc:snowflake://snowflake1.host:port/?secret=Test/RDS/Snowflake1```| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -### PageCount parameter -Limits the number of records per partition - -The default value is 500000 - -### PartitionLimit parameter -Limits the number of partitions. A large number may cause a time-out issue. Please reset to a lower value if you encounter a time-out error - -The default value is 10 - -# Data types support - -|Jdbc|Arrow| -Limit on number of partitions. A large number may cause time-out issue during running the query. Please reset to a lower value if you encounter a time-out error. - -Default partition limit is 10 - -| ---|---| -|Boolean|Bit| -|Integer|Tiny| -|Short|Smallint| -|Integer|Int| -|Long|Bigint| -|float|Float4| -|Double|Float8| -|Date|DateDay| -|Timestamp|DateMilli| -|String|Varchar| -|Bytes|Varbinary| -|BigDecimal|Decimal| -|ARRAY|List| - -See Snowflake documentation for conversion between JDBC and database types. - -# Data Types Conversion -In order to make the source (Snowflake) and Athena data types compatible, there are number of data types conversion which are being included in the Snowflake connector. This is in addition to JDBC ARROW conversion which are not covered. The purpose of these conversions is to make sure the mismatches on the type of data types have been addressed so that queries do get executed successfully. The details of the data types which have been converted are as below: - -|Source Data Type (Snowflake)| Converted Data Type (Athena) | -| ---|------------------------------| -|TIMESTAMP| DATEMILLI | -|TIMESTAMP_NTZ| DATEMILLI | -|TIMESTAMP_LTZ| DATEMILLI | -|TIMESTAMP_TZ| DATEMILLI | -|DATE| DATEDAY | -|INTEGER| INT | - -In addition, all the unsupported data types are getting converted to VARCHAR. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits - -A partition is represented by a single partition column of type varchar. We have customized partition logic for snowflake at athena layer for parallel processing. A partition is equivalent to a split. Snowflake automatically determines the most efficient compression algorithm for the columns in each micro-partition. - - -|Name|Type|Description| -|---|---|---| -|partition|varchar|custom partition in athena. E.g. p-limit-3000-offset-0,p-limit-3000-offset-3000,p-limit-3000-offset-6000| - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -_**_ Snowflake integration Test suite will not create any Snowflake RDS instances and Database Schemas, Instead it will use existing Snowflake instance. - - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-snowflake** dir, run `mvn clean install`. -5. From the **athena-snowflake** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-snowflake` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. -* In Snowflake, object names are case sensitive and this implicitly states that 2 different tables can have same name in lower and upper case i.e. 1. EMPLOYEE and 2. employee. In AFQ, the Schema/Table names are pushed to the lambda function in lower case. - In order to handle this issue and as a work around for this specific scenario, users are expected to provide query hints to retrieve the data from the tables where the name is case sensitive. Query hints can be avoided for rest of the scenarios. - Here are the sample queries with query hints. -  SELECT * FROM "lambda:athenasnowflake".SYSTEM."MY_TABLE@schemaCase=upper&tableCase=upper” -  SELECT * FROM "lambda:athenasnowflake".SYSTEM."MY_TABLE@schemaCase=upper&tableCase=lower” -* Current version does not support snowflake views. It will be added in next upgrade. - -# Performance tuning - -Suggeted to use filters in queries for optimized performance.In addition,We highly recommend native partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-snowflake.html). diff --git a/athena-sqlserver/README.md b/athena-sqlserver/README.md index eeb5030cfb..6c4294e02e 100644 --- a/athena-sqlserver/README.md +++ b/athena-sqlserver/README.md @@ -2,181 +2,4 @@ This connector enables Amazon Athena to access your Sql Server databases. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-sqlserver/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-sqlserver/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection- String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Sql Server Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: - -`sqlserver://${jdbc_connection_string}` - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -| Handler | Class | -|-------------------|--------------------------------| -| Composite Handler | SqlServerMuxCompositeHandler | -| Metadata Handler | SqlServerMuxMetadataHandler | -| Record Handler | SqlServerMuxRecordHandler | - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is sqlservercatalog then the environment variable name should be sqlservercatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Sql Server Mux Lambda function that supports two database instances, sqlserver1(default) and sqlserver2: - -| Property | Value | -|-------------------------------------|----------------------------------------------------------------------------------------------------| -| default | sqlserver://jdbc:sqlserver://sqlserver1.hostname:port;databaseName=;${secret1_name} | -| sqlservercatalog1_connection_string | sqlserver://jdbc:sqlserver://sqlserver1.hostname:port;databaseName=;${secret1_name} | -| sqlservercatalog2_connection_string | sqlserver://jdbc:sqlserver://sqlserver2.hostname:port;databaseName=;${secret2_name} | - -Sql Server Connector supports substitution of any string enclosed like *${secret1_name}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -sqlserver://jdbc:sqlserver://hostname:port;datbaseName=;${secret_name} -``` - -will be modified to: - -``` -sqlserver://jdbc:sqlserver://hostname:port;datbaseName=;user=;password= -``` - -Secret Name `secret_name` will be used to retrieve secrets. - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single Sql Server instance. -``` -Composite Handler SqlServerCompositeHandler -Metadata Handler SqlServerMetadataHandler -Record Handler SqlServerRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single SQL Server instance supported by a Lambda function:** - -| Property | Value | -|----------|----------------------------------------------------------------------------------------| -| default | sqlserver://jdbc:sqlserver://hostname:port;datbaseName=;${secret_name} | - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -| Sql Server | Arrow | -|------------------|-------------------| -| bit | TINYINT | -| tinyint | SMALLINT | -| smallint | SMALLINT | -| int | INT | -| bigint | BIGINT | -| decimal | DECIMAL | -| numeric | FLOAT8 | -| smallmoney | FLOAT8 | -| money | DECIMAL | -| float[24] | FLOAT4 | -| float[53] | FLOAT8 | -| real | FLOAT4 | -| datetime | Date(MILLISECOND) | -| datetime2 | Date(MILLISECOND) | -| smalldatetime | Date(MILLISECOND) | -| date | Date(DAY) | -| time | VARCHAR | -| datetimeoffset | Date(MILLISECOND) | -| char[n] | VARCHAR | -| varchar[n/max] | VARCHAR | -| nchar[n] | VARCHAR | -| nvarchar[n/max] | VARCHAR | -| text | VARCHAR | -| ntext | VARCHAR | - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `user` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits - -A partition is represented by a single partition column of type varchar. In case of SQL Server connector, partitions are created using a partition function. The partition function contains the logic how partition has applied on the table. The partition function and column name info is being retrieved from the SQL Server metadata table. A custom query is then used to get the partition. Based upon the number of distinct partitions received, the splits are being created. - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. -**SqlServer integration Test suite will not create any Sql Server RDS instances and Database Schemas, Instead it will use existing Sql Server RDS or EC2 instances which we specified in json config file - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-sqlserver** dir, run `mvn clean install`. -5. From the **athena-sqlserver** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-sqlserver` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations - -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. -* casting to appropriate data type in filter condition is needed for Date and Timestamp datatypes. -* Use <= or >= for searching negative values of type Real and Float. -* binary,varbinary,image, rowversion datatypes are not supported, there is data discrepancy between sqlserver data and data rendered in athena. - +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-microsoft-sql-server.html). diff --git a/athena-synapse/README.md b/athena-synapse/README.md index 131d660707..9c9434c008 100644 --- a/athena-synapse/README.md +++ b/athena-synapse/README.md @@ -2,178 +2,4 @@ This connector enables Amazon Athena to access your Azure Synapse database. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-synapse/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-synapse/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection- String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Synapse Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: - -`synapse://${jdbc_connection_string}` - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -| Handler | Class | -|-------------------|-----------------------------| -| Composite Handler | SynapseMuxCompositeHandler | -| Metadata Handler | SynapseMuxMetadataHandler | -| Record Handler | SynapseMuxRecordHandler | - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is synapsecatalog then the environment variable name should be synapsecatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Synapse Mux Lambda function that supports two database instances, synapse1(default) and synapse2: - -| Property | Value | -|-----------------------------------|-------------------------------------------------------------------------------------------------| -| default | synapse://jdbc:sqlserver://synapse1.hostname:port;databaseName=;${secret1_name} | -| synapsecatalog1_connection_string | synapse://jdbc:sqlserver://synapse1.hostname:port;databaseName=;${secret1_name} | -| synapsecatalog2_connection_string | synapse://jdbc:sqlserver://synapse2.hostname:port;databaseName=;${secret2_name} | - -Synapse Connector supports substitution of any string enclosed like *${secret1_name}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -synapse://jdbc:sqlserver://hostname:port;datbaseName=;${secret_name} -``` - -will be modified to: - -``` -synapse://jdbc:sqlserver://hostname:port;datbaseName=;user=;password= -``` - -Secret Name `secret_name` will be used to retrieve secrets. - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single Synapse instance. -``` -Composite Handler SynapseCompositeHandler -Metadata Handler SynapseMetadataHandler -Record Handler SynapseRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single Synapse instance supported by a Lambda function:** - -| Property | Value | -|----------|-------------------------------------------------------------------------------------| -| default | synapse://jdbc:sqlserver://hostname:port;datbaseName=;${secret_name} | - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - -| Synapse | Arrow | -|-----------------|-------------------| -| bit | TINYINT | -| tinyint | SMALLINT | -| smallint | SMALLINT | -| int | INT | -| bigint | BIGINT | -| decimal | DECIMAL | -| numeric | FLOAT8 | -| smallmoney | FLOAT8 | -| money | DECIMAL | -| float[24] | FLOAT4 | -| float[53] | FLOAT8 | -| real | FLOAT4 | -| datetime | Date(MILLISECOND) | -| datetime2 | Date(MILLISECOND) | -| smalldatetime | Date(MILLISECOND) | -| date | Date(DAY) | -| time | VARCHAR | -| datetimeoffset | Date(MILLISECOND) | -| char[n] | VARCHAR | -| varchar[n/max] | VARCHAR | -| nchar[n] | VARCHAR | -| nvarchar[n/max] | VARCHAR | - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `user` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits - -Synapse supports Range Partitioning. A partition is represented by a single partition column of type varchar. Partitioning strategy is implemented by extracting the partition column and partition range from Synapse metadata tables. So the splits are created by custom queries using these range values. - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. -**Synapse integration Test suite will not create any Synapse instances and Database Schemas, Instead it will use existing Azure Synapse instance which is specified in json config file - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-synapse** dir, run `mvn clean install`. -5. From the **athena-synapse** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-synapse` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations - -* Write DDL operations are not supported. -* casting to appropriate data type in filter condition is needed for Date and Timestamp datatypes. -* Use <= or >= for searching negative values of type Real and Float. -* binary, varbinary, image, rowversion datatypes are not supported, there is data discrepancy between Synapse data and data rendered in athena. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-azure-synapse.html). diff --git a/athena-teradata/README.md b/athena-teradata/README.md index 9ae4a54aab..bc80ddd7bb 100644 --- a/athena-teradata/README.md +++ b/athena-teradata/README.md @@ -2,195 +2,4 @@ This connector enables Amazon Athena to access your Teradata databases. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-teradata/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-teradata/LICENSE.txt). - -# Terms - -* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -# Usage - -## Parameters - -The Teradata Connector supports several configuration parameters using Lambda environment variables. - -### Connection String: - -A JDBC Connection string is used to connect to a database instance. Following format is supported: `teradata://${jdbc_connection_string}`. - -### Multiplexing handler parameters - -Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer. - -|Handler|Class| -|--- |--- | -|Composite Handler|TeradataMuxCompositeHandler| -|Metadata Handler|TeradataMuxCompositeHandler| -|Record Handler|TeradataMuxCompositeHandler| - - -**Parameters:** - -``` -${catalog}_connection_string Database instance connection string. One of two types specified above. Required. - Example: If the catalog as registered with Athena is teradatacatalog then the environment variable name should be teradatacatalog_connection_string - -default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`. -``` - -Example properties for a Teradata Mux Lambda function that supports two database instances teradata1(default) and teradata2: - -|Property|Value| -|---|---| -|default|teradata://jdbc:teradata://teradata2.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,user=sample2&password=sample2| -| | | -|teradata_catalog1_connection_string|teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,${Test/RDS/Teradata1}| -| | | -|teradata_catalog2_connection_string|teradata://jdbc:teradata://teradata2.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,user=sample2&password=sample2| - -Teradata Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,${Test/RDS/Teradata1}&... -``` - -will be modified to: - -``` -teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,...&user=sample2&password=sample2&... -``` - -Secret Name `Test/RDS/Teradata1` will be used to retrieve secrets. - -Currently Teradata recognizes `user` and `password` JDBC properties.It will take username and password like username/password without any key `user` and `password` - -### Single connection handler parameters - -Single connection metadata and record handlers can also be used to connect to a single Teradata instance. -``` -Composite Handler TeradataCompositeHandler -Metadata Handler TeradataMetadataHandler -Record Handler TeradataRecordHandler -``` - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored. - -**Example property for a single Teradata instance supported by a Lambda function:** - -|Property|Value| -|---|---| -|default|teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,secret=Test/RDS/Teradata1| - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -# Data types support - - -|Jdbc|Arrow| -| ---|---| -|Boolean|Bit| -|Integer|Tiny| -|Short|Smallint| -|Integer|Int| -|Long|Bigint| -|float|Float4| -|Double|Float8| -|Date|DateDay| -|Timestamp|DateMilli| -|String|Varchar| -|Bytes|Varbinary| -|BigDecimal|Decimal| -|ARRAY|List| - -See Teradata documentation for conversion between JDBC and database types. - -# Secrets - -We support two ways to input database username and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -# Partitions and Splits -A partition is represented by a single partition column of type Integer. We leverage partitions defined on a Teradata table, and this column contains partition names. For a table that does not have partition names, * is returned which is equivalent to a single partition. A partition is equivalent to a split. - -|Name|Type|Description| -|---|---|---| -|partition|Integer|Named partition in Teradata. E.g. 1,2,3| - -# Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -**Teradata integration Test suite will not create any Teradata RDS instances and Database Schemas, Instead it will use existing Teradata instances. - -# Connector deployment perquisite -For Teradata connector, we need to attach Teradata JDBC driver using Lambda Layer before deploying connector. -Here are the steps that needs to be followed for attaching a Lambda Layer. -1. Download the JDBC driver from below location (User would need to create an account on Teradata site) - https://downloads.teradata.com/download/connectivity/jdbc-driver -2. Once downloaded locally, we need to zip the jar file -3. The folder structure should be as below - Java/lib/terajdbc4.jar -4. Zip the entire folder -5. Now under AWS Console go to Lambda, Click on Layers and give a layer name, use Java 11 as run time -6. Select "Upload a .zip file" to upload jdbc zipped folder -7. Click on Create -8. Once layer created, please copy ARN which will be used during the connector deployment - For more information on creating a Lambda Layer, please refer to below link - https://docs.aws.amazon.com/lambda/latest/dg/configuration-layers.html#configuration-layers-create - - -# Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already. -2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already - (**Note: failure to follow this step will result in compilation errors**). -3. From the **athena-jdbc** dir, run `mvn clean install`. -4. From the **athena-teradata** dir, run `mvn clean install`. -5. From the **athena-teradata** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-teradata` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -# JDBC Driver Versions - -For latest version information see [pom.xml](./pom.xml). - -# Limitations -* Write DDL operations are not supported. -* In Mux setup, spill bucket and prefix is shared across all database instances. -* Any relevant Lambda Limits. See Lambda documentation. - -# Performance tuning - -Teradata supports native partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend native partitioning for retrieving huge datasets with uniform partition distribution. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-teradata.html). diff --git a/athena-timestream/README.md b/athena-timestream/README.md index 1529fd1d57..70c0c0fb29 100644 --- a/athena-timestream/README.md +++ b/athena-timestream/README.md @@ -3,76 +3,4 @@ This connector enables Amazon Athena to communicate with AWS Timestream, making your timeseries data accessible via Amazon Athena. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -## Usage - -### Parameters - -The Athena Timestream Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) -6. **glue_catalog** - (Optional) Can be used to target a cross-account Glue catalog. By default the connector will attempt to get metadata from its own Glue account. - -### Setting Up Databases & Tables - -You can optionally use AWS Glue Data Catalog as a source of supplemental metadata. When enabled, the Timestream connector will check for a matching database and table in AWS Glue which has one or more of the below properties. This is particularly useful if you want to use feature of Timestream's SQL Dialect which are not natively supported by Athena. if the matching Glue table is infact a 'view' the Athena Timestream Connector will use that SQL from the view, in conjunction with your Athena SQL Query to access your data. - -To enable a Glue Table for use with Timestream, you can set the following properties on the Table. - -1. **timestream-metadata-flag** - Flag indicating that the table can be used for supplemental meta-data by the Athena Timestream Connector. The value is unimportant as long as this key is present in the properties of the table. -2. **_view_template** - When using Glue for supplimental metadata, you can set this table property and include any arbitrary Timestream SQL as the 'view'. The Athena Timestream connector will then use this SQL, combined with the SQL from Athena, to run your query. This allows you to access feature in Timestream SQL that are not avialable in Athena's SQL dialect. - -### Data Types - -This connector presently only supports a subset of the data types available in Timestream, noteably: scalar values of varchar, double, timestamp. - -In order to query the "timeseries" data type, you will need to setup a 'view' in AWS Glue that leverages the 'CREATE_TIME_SERIES' function of Timestream. You'll also need to supply a schema for the view which uses "ARRAY>" as the type of any your timeseries columns. Be sure to replace "double" with the apprpopriate scalar type for your table. - -Below is a sample of how you can setup such a view over a timeseries in AWS Glue. - -![Example](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-timestream/docs/img/timestream_glue_example.png?raw=true) - - - -### Required Permissions - -Review the "Policies" section of the athena-redis.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -2. Glue Data Catalog - Since Redis does not have a meta-data store, the connector requires Read-Only access to Glue's DataCatalog for obtaining Redis key to table/column mappings. -3. CloudWatch Logs - This is a somewhat implicit permission when deploying a Lambda function but it needs access to cloudwatch logs for storing logs. -4. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. -5. Timestream Access - In order to run Timestream queries. - -### Running Integration Tests - -The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, -the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly. -For build commands and step-by-step instructions on building and running the integration tests see the -[Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module. - -In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file. -For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module. - -Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory. - -### Deploying The Connector - -To use the Amazon Athena Timestream Connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-redis dir, run `mvn clean install`. -3. From the athena-redis dir, run `../tools/publish.sh S3_BUCKET_NAME athena-timestream` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) - -## Performance - -The performance of this connector is currently a work in progress and is significantly (> 2x) slower than running queries from Timestream itself. We recommend limiting the data returned (not data scanned) to less than 256MB for the initial release. There are a number of unique and interesting use cases that are possible even well below the 256MB recommendation. - -## License - -This project is licensed under the Apache-2.0 License. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-timestream.html). diff --git a/athena-tpcds/README.md b/athena-tpcds/README.md index 27d0276fae..49389c92e4 100644 --- a/athena-tpcds/README.md +++ b/athena-tpcds/README.md @@ -2,137 +2,4 @@ This connector enables Amazon Athena to communicate with a source of randomly generated TPC-DS data for use in benchmarking and functional testing of Athena Federation. We do _not_ recommend the use of this connector as an alternative to S3 based data lake performance tests. -**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.** - -## Usage - -### Parameters - -The Athena TPC-DS Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket) -2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours. -3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331) -5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False) - -### Databases & Tables - -The Athena TPC-DS Connector generates a TPC-DS compliant database at one of four ("tpcds1", "tpcds10", "tpcds100", "tpcds250", "tpcds1000") scale factors. - -For a complete list of tables and columns please use `show tables` and `describe table` queries and a summary of tables below. You can find copies of TPC-DS queries that are compatible with this generated schema and data in the src/main/resources/queries directory of this module. - -1. call_center -1. catalog_page -1. catalog_returns -1. catalog_sales -1. customer -1. customer_address -1. customer_demographics -1. date_dim -1. dbgen_version -1. household_demographics -1. income_band -1. inventory -1. item -1. promotion -1. reason -1. ship_mode -1. store -1. store_returns -1. store_sales -1. time_dim -1. warehouse -1. web_page -1. web_returns -1. web_sales -1. web_site - -The below query is one example that is setup for use with a catalog called tpcds. - -```sql -SELECT - cd_gender, - cd_marital_status, - cd_education_status, - count(*) cnt1, - cd_purchase_estimate, - count(*) cnt2, - cd_credit_rating, - count(*) cnt3, - cd_dep_count, - count(*) cnt4, - cd_dep_employed_count, - count(*) cnt5, - cd_dep_college_count, - count(*) cnt6 -FROM - "lambda:tpcds".tpcds1.customer c, "lambda:tpcds".tpcds1.customer_address ca, "lambda:tpcds".tpcds1.customer_demographics -WHERE - c.c_current_addr_sk = ca.ca_address_sk AND - ca_county IN ('Rush County', 'Toole County', 'Jefferson County', - 'Dona Ana County', 'La Porte County') AND - cd_demo_sk = c.c_current_cdemo_sk AND - exists(SELECT * - FROM "lambda:tpcds".tpcds1.store_sales, "lambda:tpcds".tpcds1.date_dim - WHERE c.c_customer_sk = ss_customer_sk AND - ss_sold_date_sk = d_date_sk AND - d_year = 2002 AND - d_moy BETWEEN 1 AND 1 + 3) AND - (exists(SELECT * - FROM "lambda:tpcds".tpcds1.web_sales, "lambda:tpcds".tpcds1.date_dim - WHERE c.c_customer_sk = ws_bill_customer_sk AND - ws_sold_date_sk = d_date_sk AND - d_year = 2002 AND - d_moy BETWEEN 1 AND 1 + 3) OR - exists(SELECT * - FROM "lambda:tpcds".tpcds1.catalog_sales, "lambda:tpcds".tpcds1.date_dim - WHERE c.c_customer_sk = cs_ship_customer_sk AND - cs_sold_date_sk = d_date_sk AND - d_year = 2002 AND - d_moy BETWEEN 1 AND 1 + 3)) -GROUP BY cd_gender, - cd_marital_status, - cd_education_status, - cd_purchase_estimate, - cd_credit_rating, - cd_dep_count, - cd_dep_employed_count, - cd_dep_college_count -ORDER BY cd_gender, - cd_marital_status, - cd_education_status, - cd_purchase_estimate, - cd_credit_rating, - cd_dep_count, - cd_dep_employed_count, - cd_dep_college_count -LIMIT 100 -``` - -### Required Permissions - -Review the "Policies" section of the athena-tpcds.yaml file for full details on the IAM Policies required by this connector. A brief summary is below. - -1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3. -1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated. - -### Deploying The Connector - -To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module: - -1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already. -2. From the athena-tpcds dir, run `mvn clean install`. -3. From the athena-tpcds dir, run `../tools/publish.sh S3_BUCKET_NAME athena-tpcds` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) -4. Try running a query like the one below in Athena: -```sql -select * from "lambda:".schema.table limit 100 -``` - -## Performance - -The Athena tpcds Connector will attempt to parallelize queries based on the scale factor you have choosen. Predicate Pushdown is performed within the Lambda function. - -## License - -This project is licensed under the Apache-2.0 License. +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-tpcds.html). diff --git a/athena-vertica/README.md b/athena-vertica/README.md index f6fdac2ba7..88eb902833 100644 --- a/athena-vertica/README.md +++ b/athena-vertica/README.md @@ -2,149 +2,4 @@ This connector enables Amazon Athena to communicate with your Vertica Database instance(s), making your Vertica data accessible via SQL. -### Design - -![Alt text](av.jpg "Architecture") - -The Amazon Athena Vertica connector for federated queries uses following design. -1. A SQL query is issued against table(s) in Vertica. -2. The connector will parse the SQL query to send the portion to Vertica through the JDBC connection to Vertica. -3. The connection strings will use the Username and Password stored in AWS Secrets Manager to gain access to Vertica -4. The connector will wrap the SQL query against Vertica with a Vertica EXPORT command like this example: - ``` - EXPORT TO PARQUET (directory = 's3://, Compression='Snappy', fileSizeMB=64) OVER() as - SELECT - PATH_ID, - …. - SOURCE_ITEMIZED, - SOURCE_OVERRIDE - FROM DELETED_OBJECT_SCHEMA.FORM_USAGE_DATA - WHERE PATH_ID <= 5; - ``` -5. Vertica will process the SQL query and send the result set to a an S3 bucket instead of sending the result set through the JDBC connection back to Athena, as shown in the EXPORT command above. This will eliminate the performance problem of sending large multi-gb result sets through the JDBC connection. Instead Vertica can utilize the EXPORT utility to parallelize the write to the S3 bucket by writing multiple parquet files using EXPORT option and achieve high bandwidth on the write to S3. -6. Athena will scan the S3 bucket to determine the number of files to read for the result set. -7. Athena will call Lambda with multiple calls to read back all the parquet files (using S3 SELECT) that comprise the result set from Vertica. This will allow Athena to parallelize the read of the S3 files up to a maximum bandwidth up to 100GB per second. -8. Athena will process the data returned from Vertica with data scanned from the data lake and return the result to the user. - - -### Parameters - -The Amazon Athena Vertica Connector exposes several configuration options via Lambda environment variables. More detail on the available parameters can be found below. - -* **AthenaCatalogName:** Lambda function name -* **ExportBucket:** The S3 bucket where the Vertica query results will be exported. -* **SpillBucket:** The name of the bucket where this function can spill data. -* **SpillPrefix:** The prefix within SpillBucket where this function can spill data. -* **SecurityGroupIds:** One or more SecurityGroup IDs corresponding to the SecurityGroup that should be applied to the Lambda function. (e.g. sg1, sg2, sg3) -* **SubnetIds:** One or more Subnet IDs corresponding to the Subnet that the Lambda function can use to access your data source. (e.g. subnet1, subnet2) -* **SecretNameOrPrefix:** The name or prefix of a set of names within Secrets Manager that this function should have access to. (e.g. vertica-*) -* **VerticaConnectionString:** The Vertica connection details to use by default if not catalog specific connection is defined and optionally using SecretsManager (e.g. ${secret_name}). -* **VPC ID:** The VPC Id to be attached to the Lambda function - -### Terms - -* **Database Instance:** Any instance of a Vertica database deployed on EC2. -* **Database type:** Vertica. -* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler. -* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s). -* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s). -* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler. -* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables. -* **Connection String:** Used to establish connection to a database instance. -* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property. - -### Connection String: - -Connection string is used to connect to a database instance. Following format is supported: - -`jdbc:vertica://:/?user=&password=` - - -Amazon Athena Vertica Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example: - -``` -jdbc:vertica://:/?user=${vertica-username}&password=${vertica-password} -``` -will be modified to: - -``` -jdbc:vertica://:/?user=sample-user&password=sample-password -``` -Secret Name `vertica-username` will be used to retrieve secrets. - -Currently supported databases recognize `vertica-username` and `vertica-password` JDBC properties. - -### Database specific handler parameters - -Database specific metadata and record handlers can also be used to connect to a database instance. These are currently capable of connecting to a single database instance. - -|Handler|Class| -|---|---| -|Composite Handler|VerticaCompositeHandler| -|Metadata Handler|VerticaMetadataHandler| -|Record Handler|VerticaRecordHandler| - -**Parameters:** - -``` -default Default connection string. Required. This will be used when a catalog is not recognized. -``` - -### Spill parameters: - -Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location. - -``` -spill_bucket Spill bucket name. Required. -spill_prefix Spill bucket key prefix. Required. -spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html -``` - -### Vertica Data types supported - -|Vertica| -| ---| -|Boolean| -|BigInt| -|Short| -|Integer| -|Long| -|float| -|Double| -|Date| -|Varchar| -|Bytes| -|BigDecimal| -|TimeStamp as Varchar| - -See respective database documentation for conversion between JDBC and database types. - -### Secrets - -We support two ways to input database user name and password: - -1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `vertica-username` and `vertica-password` values from AWS Secrets Manager. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager. -2. **Connection String:** Username and password can be specified as properties in the JDBC connection string. - -### Deploying The Connector - -To use the Amazon Athena Vertica Connector in your queries build and deploy this connector from source follow the below steps: - -1. From the athena-federation dir, run `mvn clean install` if you haven't already. -2. From the athena-vertica dir, run `mvn clean install`. -3. From the athena-vertica dir, run `../tools/publish.sh S3_BUCKET_NAME athena-vertica [region]` to publish the connector to your private AWS Serverless Application Repository. The `S3_BUCKET_NAME` in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo) -4. Deploy the serverless application following the instructions [here](https://docs.aws.amazon.com/serverlessrepo/latest/devguide/serverlessrepo-how-to-consume.html) -5. Connect the serverless application with Athena following the instructions [here](https://docs.aws.amazon.com/athena/latest/ug/connect-to-a-data-source-lambda.html) - -### Vertica Drivers - -The POM references the Vertica drivers hosted in Maven Central repository. - - -### Limitations -1. The A/V connector uses S3 Select internally to read the parquet files from S3, this will cause slow performance of the connector. It is recommended to use a `CREATE TABLE AS (SELECT ..)` and use SQL predicates when querying large tables -2. Currently, due to a bug in Athena Federated Query, the connector will cause Vertica to export ALL the columns of the queried table to S3 but only the queried columns will be visible in the results on Athena console -3. Write DDL operations are not supported -4. Any relevant AWS Lambda limits - - +Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-vertica.html).