Skip to content

Commit

Permalink
Mass port READMEs to point to Amazon Athena Documentation
Browse files Browse the repository at this point in the history
  • Loading branch information
Michael Hackett authored and henrymai committed Sep 12, 2022
1 parent cea6923 commit 660810a
Show file tree
Hide file tree
Showing 25 changed files with 25 additions and 3,500 deletions.
66 changes: 1 addition & 65 deletions athena-aws-cmdb/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,68 +2,4 @@

This connector enables Amazon Athena to communicate with various AWS Services, making your AWS Resource inventory accessible via SQL.

**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.**

## Usage

### Parameters

The Athena AWS CMDB Connector provides several configuration options via Lambda environment variables. More detail on the available parameters can be found below.

1. **spill_bucket** - When the data returned by your Lambda function exceeds Lambda’s limits, this is the bucket that the data will be written to for Athena to read the excess from. (e.g. my_bucket)
2. **spill_prefix** - (Optional) Defaults to sub-folder in your bucket called 'athena-federation-spill'. Used in conjunction with spill_bucket, this is the path within the above bucket that large responses are spilled to. You should configure an S3 lifecycle on this location to delete old spills after X days/Hours.
3. **spill_put_request_headers** - (Optional) This is a JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html
4. **kms_key_id** - (Optional) By default any data that is spilled to S3 is encrypted using AES-GCM and a randomly generated key. Setting a KMS Key ID allows your Lambda function to use KMS for key generation for a stronger source of encryption keys. (e.g. a7e63k4b-8loc-40db-a2a1-4d0en2cd8331)
5. **disable_spill_encryption** - (Optional) Defaults to False so that any data that is spilled to S3 is encrypted using AES-GCM either with a randomly generated key or using KMS to generate keys. Setting this to false will disable spill encryption. You may wish to disable this for improved performance, especially if your spill location in S3 uses S3 Server Side Encryption. (e.g. True or False)
6. **default_ec2_image_owner** - (Optional) When set, this controls the default ec2 image (aka AMI) owner used to filter AMIs. When this isn't set and your query against the ec2 images table does not include a filter for owner you will get a large number of results since the response will include all public images.

### Databases & Tables

The Athena AWS CMDB Connector makes the following databases and tables available for querying your AWS Resource Inventory. For more information on the columns available in each table, try running a 'describe database.table' from the Athena Console or API.

1. **ec2** - This database contains EC2 related resources, including:
* **ebs_volumes** - Contains details of you EBS volumes.
* **ec2_instances** - Contains details of your EC2 Instances.
* **ec2_images** - Contains details of your EC2 Instance images.
* **routing_tables** - Contains details of your VPC Routing Tables.
* **security_groups** - Contains details of your Security Groups.
* **subnets** - Contains details of your VPC Subnets.
* **vpcs** - Contains details of your VPCs.
2. **emr** - This database contains EMR related resources, including:
* **emr_clusters** - Contains details of your EMR Clusters.
3. **rds** - This database contains RDS related resources, including:
* **rds_instances** - Contains details of your RDS Instances.
4. **s3** - This database contains RDS related resources, including:
* **buckets** - Contains details of your S3 buckets.
* **objects** - Contains details of your S3 Objects (excludes their contents).

### Required Permissions

Review the "Policies" section of the athena-aws-cmdb.yaml file for full details on the IAM Policies required by this connector. A brief summary is below.

1. S3 Write Access - In order to successfully handle large queries, the connector requires write access to a location in S3.
1. EC2 Describe - The connector uses this access to describe your EC2 Instances, Security Groups, VPCs, EBS Volumes, etc...
1. EMR Describe / List - The connector uses this access to describe your EMR Clusters.
1. RDS Describe - The connector uses this access to describe your RDS Instances.
1. S3 List - The connector uses this access to list your buckets and objects.
1. Athena GetQueryExecution - The connector uses this access to fast-fail when the upstream Athena query has terminated.

### Deploying The Connector

To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from source follow the below steps or use the more detailed tutorial in the athena-example module:

1. From the athena-federation-sdk dir, run `mvn clean install` if you haven't already.
2. From the athena-aws-cmdb dir, run `mvn clean install`.
3. From the athena-aws-cmdb dir, run `../tools/publish.sh S3_BUCKET_NAME athena-aws-cmdb` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo)
4. Try running a query like the one below in Athena:
```sql
select * from "lambda:<CATALOG_NAME>".ec2.ec2_instances limit 100
```

## Performance

The Athena AWS CMDB Connector does not current support parallel scans. Predicate Pushdown is performed within the Lambda function and where possible partial predicates are pushed to the services being queried. For example, a query for the details of a specific EC2 Instance will turn into a targeted describe of that specific instance id against the EC2 API.

## License

This project is licensed under the Apache-2.0 License.
Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-cmdb.html).
177 changes: 1 addition & 176 deletions athena-cloudera-hive/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,179 +2,4 @@

This connector enables Amazon Athena to access your Cloudera Hive databases.

**Athena Federated Queries are now enabled as GA in us-east-1, us-east-2, us-west-2, eu-west-1, ap-northeast-1, ap-south-1, us-west-1, ap-southeast-1, ap-southeast-2, eu-west-2, ap-northeast-2, eu-west-3, ca-central-1, sa-east-1, and eu-central-1. To use this feature, upgrade your engine version to Athena V2 in your workgroup settings. Check documentation here for more details: https://docs.aws.amazon.com/athena/latest/ug/engine-versions.html.**

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in [`pom.xml`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/pom.xml) and agree to the terms in their respective licenses, provided in [`LICENSE.txt`](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/LICENSE.txt).

# Terms

* **Database Instance:** Any instance of a database deployed on premises, EC2 or using RDS.
* **Handler:** A Lambda handler accessing your database instance(s). Could be metadata or a record handler.
* **Metadata Handler:** A Lambda handler that retrieves metadata from your database instance(s).
* **Record Handler:** A Lambda handler that retrieves data records from your database instance(s).
* **Composite Handler:** A Lambda handler that retrieves metadata and data records from your database instance(s). This is recommended to be set as lambda function handler.
* **Multiplexing Handler:** a Lambda handler that can accept and use multiple different database connections.
* **Property/Parameter:** A database property used by handlers to extract database information for connection. These are set as Lambda environment variables.
* **Connection String:** Used to establish connection to a database instance.
* **Catalog:** Athena Catalog. This is not a Glue Catalog. Must be used to prefix `connection_string` property.

# Usage

## Parameters

The Cloudera Hive Connector supports several configuration parameters using Lambda environment variables.

### Connection String:

A JDBC Connection string is used to connect to a database instance. Following format is supported: `hive://${jdbc_connection_string}`.

### Multiplexing handler parameters

Multiplexer provides a way to connect to multiple database instances using a single Lambda function. Requests are routed depending on catalog name. Use following classes in Lambda for using multiplexer.

|Handler|Class|
|--- |--- |
|Composite Handler|HiveMuxCompositeHandler|
|Metadata Handler|HiveMuxMetadataHandler|
|Record Handler|HiveMuxRecordHandler|


**Parameters:**

```
${catalog}_connection_string Database instance connection string. One of two types specified above. Required.
Example: If the catalog as registered with Athena is myhivecatalog then the environment variable name should be myhivecatalog_connection_string
default Default connection string. Required. This will be used when catalog is `lambda:${AWS_LAMBDA_FUNCTION_NAME}`.
```

Example properties for a Hive Mux Lambda function that supports two database instances, hive1host(default) and hive2host:

|Property|Value|
|---|---|
|default|hive://jdbc:hive2://hive1host:10000/default?${Test/RDS/hive1host}|
| | |
|hive2_catalog1_connection_string|hive://jdbc:hive2://hive1host:10000/default?${Test/RDS/hive1host}|
| | |
|hive2_catalog2_connection_string|hive://jdbc:hive2://hive2host:10000/default?UID=sample&PWD=sample|

Cloudera Hive Connector supports substitution of any string enclosed like *${SecretName}* with *username* and *password* retrieved from AWS Secrets Manager. Example:

```
hive://jdbc:hive2://hive1host:10000/default?...&${Test/RDS/hive1host}&...
```

will be modified to:

```
hive://jdbc:hive2://hive1host:10000/default?...&UID=sample2&PWD=sample2&...
```

Secret Name `Test/RDS/hive1host` will be used to retrieve secrets.

Currently Cloudera Hive recognizes `UID` and `PWD` JDBC properties.

### Single connection handler parameters

Single connection metadata and record handlers can also be used to connect to a single cloudera Hive instance.
```
Composite Handler HiveCompositeHandler
Metadata Handler HiveMetadataHandler
Record Handler HiveRecordHandler
```

**Parameters:**

```
default Default connection string. Required. This will be used when a catalog is not recognized.
```

These handlers support one database instance and must provide `default` connection string parameter. All other connection strings are ignored.

**Example property for a single Cloudera Hive instance supported by a Lambda function:**

|Property|Value|
|---|---|
|default|hive://jdbc:hive2://hive1host:10000/default?secret=${Test/RDS/hive1host}|

### Spill parameters:

Lambda SDK may spill data to S3. All database instances accessed using a single Lambda spill to the same location.

```
spill_bucket Spill bucket name. Required.
spill_prefix Spill bucket key prefix. Required.
spill_put_request_headers JSON encoded map of request headers and values for the s3 putObject request used for spilling. Example: `{"x-amz-server-side-encryption" : "AES256"}`. For more possible headers see: https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html
```

# Data types support

|Jdbc| *Cloudera Hive[] |Arrow|
| ---|------------------|---|
|Boolean| Boolean[] |Bit|
|Integer| TINYINT[] |Tiny|
|Short| SMALLINT[] |Smallint|
|Integer| INT[] |Int|
|Long| BIGINT[] |Bigint|
|float| float4[] |Float4|
|Double| float8[] |Float8|
|Date| date[] |DateDay|
|Timestamp| timestamp[] |DateMilli|
|String| VARCHAR[] |Varchar|
|Bytes| bytes[] |Varbinary|
|BigDecimal| Decimal[] |Decimal|
|**\*ARRAY**| **N/A** |List|

See Cloudera Hive documentation for conversion between JDBC and database types.

**\*NOTE**: The aggregate types (ARRAY, MAP, STRUCT, and UNIONTYPE) are not yet supported by Cloudera Hive. Columns of aggregate types are treated as VARCHAR columns in SQL and STRING columns in Java.

# Secrets

We support two ways to input database username and password:

1. **AWS Secrets Manager:** The name of the secret in AWS Secrets Manager can be embedded in JDBC connection string, which is used to replace with `username` and `password` values from Secret. Support is tightly integrated for AWS RDS database instances. When using AWS RDS, we highly recommend using AWS Secrets Manager, including credential rotation. If your database is not using AWS RDS, store credentials as JSON in the following format `{“username”: “${username}”, “password”: “${password}”}.`. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) to connect to Secrets Manager.
2. **Connection String:** Username and password can be specified as properties in the JDBC connection string.

# Partitions and Splits
A partition is represented by one partition column of type varchar. We leverage partitions as columns defined on a Cloudera Hive table, and these columns contains parition on columns information. For a table that does not have partition names, * is returned which is equivalent to a single partition. A partition is equivalent to a split.

| Name | Type | Description |
|-----------|---------|-------------|
| partition | Varchar |Partition information on table columns|

# Running Integration Tests

The integration tests in this module are designed to run without the prior need for deploying the connector. Nevertheless, the integration tests will not run straight out-of-the-box. Certain build-dependencies are required for them to execute correctly.
For build commands and step-by-step instructions on building and running the integration tests see the [Running Integration Tests](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#running-integration-tests) README section in the **athena-federation-integ-test** module.

In addition to the build-dependencies, certain test configuration attributes must also be provided in the connector's [test-config.json](./etc/test-config.json) JSON file.
For additional information about the test configuration file, see the [Test Configuration](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-federation-integ-test/README.md#test-configuration) README section in the **athena-federation-integ-test** module.

Once all prerequisites have been satisfied, the integration tests can be executed by specifying the following command: `mvn failsafe:integration-test` from the connector's root directory.
**Cloudera Hive integration Test suite will not create any Cloudera Hive service and Datasets, instead it will use existing Cloudera Hive databases.

# Deploying The Connector

To use this connector in your queries, navigate to AWS Serverless Application Repository and deploy a pre-built version of this connector. Alternatively, you can build and deploy this connector from
source follow the below steps or use the more detailed tutorial in the athena-example module:

1. From the **athena-federation-sdk** dir, run `mvn clean install` if you haven't already.
2. From the **athena-federation-integ-test** dir, run `mvn clean install` if you haven't already
(**Note: failure to follow this step will result in compilation errors**).
3. From the **athena-jdbc** dir, run `mvn clean install`.
4. From the **athena-cloudera-hive** dir, run `mvn clean install`.
5. From the **athena-cloudera-hive** dir, run `../tools/publish.sh S3_BUCKET_NAME athena-hive2` to publish the connector to your private AWS Serverless Application Repository. The S3_BUCKET in the command is where a copy of the connector's code will be stored for Serverless Application Repository to retrieve it. This will allow users with permission to do so, the ability to deploy instances of the connector via 1-Click form. Then navigate to [Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo)

# JDBC Driver Versions

For latest version information see [pom.xml](./pom.xml).

# Limitations
* Write DDL operations are not supported.
* In Mux setup, spill bucket and prefix is shared across all database instances.
* Any relevant Lambda Limits. See Lambda documentation.

# Performance tuning

Hive supports Static partitions. Athena's lambda connector can retrieve data from these partitions in parallel. We highly recommend static partitioning for retrieving huge datasets with uniform partition distribution.
Documentation has moved [here](https://docs.aws.amazon.com/athena/latest/ug/connectors-cloudera-hive.html).
Loading

0 comments on commit 660810a

Please sign in to comment.