Skip to content

Commit ab6ca9f

Browse files
authored
Merge pull request #2 from slicelife/update-upstream
2 parents 1fb8e19 + 034b0b8 commit ab6ca9f

File tree

4 files changed

+176
-22
lines changed

4 files changed

+176
-22
lines changed

README.md

Lines changed: 69 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -32,7 +32,7 @@ Terraform modules which create AWS resources for a Segment Data Lake.
3232
* Authorized [AWS account](https://aws.amazon.com/account/).
3333
* Ability to run Terraform with your AWS Account. You must use Terraform 0.11 or higher.
3434
* A subnet within a VPC for the EMR cluster to run in.
35-
* [S3 Bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) to send data from Segment to and to store logs.
35+
* An [S3 Bucket](https://github.com/terraform-aws-modules/terraform-aws-s3-bucket) for Segment to load data into. You can create a new one just for this, or re-use an existing one you already have.
3636

3737
## VPC
3838

@@ -50,6 +50,33 @@ The repository is split into multiple modules, and each can be used independentl
5050

5151
# Usage
5252

53+
## Terraform Installation
54+
*Note* - Skip this section if you already have a working Terraform setup
55+
### OSX:
56+
`brew` on OSX should install the latest version of Terraform.
57+
```
58+
brew install terraform
59+
```
60+
61+
### Centos/Ubuntu:
62+
* Follow instructions [here](https://phoenixnap.com/kb/how-to-install-terraform-centos-ubuntu) to install on Centos/Ubuntu OS.
63+
* Ensure that the version installed in > 0.11.x
64+
65+
Verify installation works by running:
66+
```
67+
terraform help
68+
```
69+
70+
## Set up Project
71+
* Create project directory
72+
```
73+
mkdir segment-datalakes-tf
74+
```
75+
* Create `main.tf` file
76+
* Update the `segment_sources` variable in the `locals` to the sources you want to sync
77+
* Update the `name` in the `aws_s3_bucket` resource to the desired name of your S3 bucket
78+
* Update the `subnet_id` in the `emr` module to the subnet in which to create the EMR cluster
79+
5380
```hcl
5481
provider "aws" {
5582
region = "us-west-2" # Replace this with the AWS region your infrastructure is set up in.
@@ -70,14 +97,6 @@ resource "aws_s3_bucket" "segment_datalake_s3" {
7097
name = "my-first-segment-datalake"
7198
}
7299
73-
# This is optional.
74-
# Segment will create a DB for you if it does not exist already.
75-
module "glue" {
76-
source = "git@github.com:segmentio/terraform-aws-data-lake//modules/glue?ref=v0.1.5"
77-
78-
name = "segment_data_lake"
79-
}
80-
81100
# Creates the IAM Policy that allows Segment to access the necessary resources
82101
# in your AWS account for loading your data.
83102
module "iam" {
@@ -91,18 +110,53 @@ module "iam" {
91110
# Creates an EMR Cluster that Segment uses for performing the final ETL on your
92111
# data that lands in S3.
93112
module "emr" {
94-
source = "git@github.com:segmentio/terraform-aws-data-lake//modules/emr?ref=v0.1.5"
113+
source = "git@github.com:segmentio/terraform-aws-data-lake//modules/emr?ref=v0.2.0"
95114
96115
s3_bucket = "${aws_s3_bucket.segment_datalake_s3.name}"
97116
subnet_id = "subnet-XXX" # Replace this with the subnet ID you want the EMR cluster to run in.
117+
118+
# LEAVE THIS AS-IS
119+
iam_emr_autoscaling_role = "${module.iam.iam_emr_autoscaling_role}"
120+
iam_emr_service_role = "${module.iam.iam_emr_service_role}"
121+
iam_emr_instance_profile = "${module.iam.iam_emr_instance_profile}"
98122
}
99123
```
100-
101-
With the Terraform CLI, you can run `terraform plan` to preview the changes by the modules, and `terraform apply` to generate the resources.
124+
## Provision Resources
125+
* Provide AWS credentials of the account being used. More details here: https://www.terraform.io/docs/providers/aws/index.html
126+
```
127+
export AWS_ACCESS_KEY_ID="anaccesskey"
128+
export AWS_SECRET_ACCESS_KEY="asecretkey"
129+
export AWS_DEFAULT_REGION="us-west-2"
130+
```
131+
* Initialize the references modules
132+
```
133+
terraform init
134+
```
135+
You should see a success message once you run the plan:
136+
```
137+
Terraform has been successfully initialized!
138+
```
139+
* Run plan
140+
This does not create any resources. It just outputs what will be created after you run apply(next step).
141+
```
142+
terraform plan
143+
```
144+
You should see something like towards the end of the plan:
145+
```
146+
Plan: 13 to add, 0 to change, 0 to destroy.
147+
```
148+
* Run apply - this step creates the resources in your AWS infrastructure
149+
```
150+
terraform apply
151+
```
152+
You should see:
153+
```
154+
Apply complete! Resources: 13 added, 0 changed, 0 destroyed.
155+
```
102156

103157
Note that creating the EMR cluster can take a while (typically 5 minutes).
104158

105-
Once applied, make a note of the following (you'll need to provide this information to your Segment contact):
159+
Once applied, make a note of the following (you'll need to enter these as settings when configuring the Data Lake):
106160
* The **AWS Region** and **AWS Account ID** where your Data Lake was configured
107161
* The **Source ID and Slug** for _each_ Segment source that will be connected to the data lake
108162
* The generated **EMR Cluster ID**
@@ -148,6 +202,8 @@ If all else fails, teardown and start over.
148202

149203
Terraform 0.11 or higher is supported.
150204

205+
NOTE: Release v0.2.0 onwards only Terraform 0.12 or higher is supported.
206+
151207
# Development
152208

153209
To develop in this repository, you'll want the following tools set up:

modules/emr/README.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -48,6 +48,62 @@ Type: `string`
4848

4949
Default: `""`
5050

51+
### master\_instance\_type
52+
53+
Description: EC2 Instance Type for Master
54+
55+
Type: `string`
56+
57+
Default: `"m5.xlarge"`
58+
59+
### core\_instance\_type
60+
61+
Description: EC2 Instance Type for Core Nodes
62+
63+
Type: `string`
64+
65+
Default: `"m5.xlarge"`
66+
67+
# task\_instance\_type
68+
69+
Description: EC2 Instance Type for Task Nodes
70+
71+
Type: `string`
72+
73+
Default: `"m5.xlarge"`
74+
75+
# core\_instance\_count
76+
77+
Description: Number of instances of Core Nodes
78+
79+
Type: `string`
80+
81+
Default: `"2"`
82+
83+
# core\_instance\_max\_count
84+
85+
Description: Max number of Core Nodes used on autoscale
86+
87+
Type: `string`
88+
89+
Default: `"4"`
90+
91+
# task\_instance\_count
92+
93+
Description: Number of instances of Task Nodes
94+
95+
Type: `string`
96+
97+
Default: `"2"`
98+
99+
# task\_instance\_max\_count
100+
101+
Description: Max number of Task Nodes used on autoscale
102+
103+
Type: `string`
104+
105+
Default: `"4"`
106+
51107
### tags
52108

53109
Description: A map of tags to add to all resources. A vendor=segment tag will be added automatically (which is also used by the IAM policy to provide Segment access to submit jobs).

modules/emr/main.tf

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ resource "aws_emr_cluster" "segment_data_lake_emr_cluster" {
1818
autoscaling_role = "${var.iam_emr_autoscaling_role}"
1919

2020
master_instance_group {
21-
instance_type = "m5.xlarge"
21+
instance_type = "${var.master_instance_type}"
2222
name = "master_group"
2323

2424
ebs_config {
@@ -29,8 +29,8 @@ resource "aws_emr_cluster" "segment_data_lake_emr_cluster" {
2929
}
3030

3131
core_instance_group {
32-
instance_type = "m5.xlarge"
33-
instance_count = 2
32+
instance_type = "${var.core_instance_type}"
33+
instance_count = "${var.core_instance_count}"
3434
name = "core_group"
3535

3636

@@ -43,8 +43,8 @@ resource "aws_emr_cluster" "segment_data_lake_emr_cluster" {
4343
autoscaling_policy = <<EOF
4444
{
4545
"Constraints": {
46-
"MinCapacity": 2,
47-
"MaxCapacity": 4
46+
"MinCapacity": ${var.core_instance_count},
47+
"MaxCapacity": ${var.core_instance_max_count}
4848
},
4949
"Rules": [{
5050
"Action": {
@@ -120,8 +120,8 @@ resource "aws_emr_instance_group" "task" {
120120
name = "task_group"
121121
cluster_id = join("", aws_emr_cluster.segment_data_lake_emr_cluster.*.id)
122122

123-
instance_type = "m5.xlarge"
124-
instance_count = "2"
123+
instance_type = "${var.task_instance_type}"
124+
instance_count = "${var.task_instance_count}"
125125

126126
ebs_config {
127127
size = "64"
@@ -132,8 +132,8 @@ resource "aws_emr_instance_group" "task" {
132132
autoscaling_policy = <<EOF
133133
{
134134
"Constraints": {
135-
"MinCapacity": 2,
136-
"MaxCapacity": 4
135+
"MinCapacity": ${var.task_instance_count},
136+
"MaxCapacity": ${var.task_instance_max_count}
137137
},
138138
"Rules": [{
139139
"Action": {

modules/emr/variables.tf

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,48 @@ variable "iam_emr_instance_profile" {
5353
type = "string"
5454
}
5555

56+
variable "master_instance_type" {
57+
description = "EC2 Instance Type for Master"
58+
type = "string"
59+
default = "m5.xlarge"
60+
}
61+
62+
variable "core_instance_type" {
63+
description = "EC2 Instance Type for Core Nodes"
64+
type = "string"
65+
default = "m5.xlarge"
66+
}
67+
68+
variable "task_instance_type" {
69+
description = "EC2 Instance Type for Task Nodes"
70+
type = "string"
71+
default = "m5.xlarge"
72+
}
73+
74+
variable "core_instance_count" {
75+
description = "Number of Core Nodes"
76+
type = "string"
77+
default = "2"
78+
}
79+
80+
variable "core_instance_max_count" {
81+
description = "Max number of Core Nodes used on autoscale"
82+
type = "string"
83+
default = "4"
84+
}
85+
86+
variable "task_instance_count" {
87+
description = "Number of instances of Task Nodes"
88+
type = "string"
89+
default = "2"
90+
}
91+
92+
variable "task_instance_max_count" {
93+
description = "Max number of Task Nodes used on autoscale"
94+
type = "string"
95+
default = "4"
96+
}
97+
5698
locals {
5799
tags = "${merge(map("vendor", "segment"), var.tags)}"
58100
}

0 commit comments

Comments
 (0)