1. Prepare environment

https://s3.console.aws.amazon.com/s3/buckets/crawler-public-ap-northeast-1/flight/

s3://crawler-public-ap-northeast-1/flight/2016/csv/

Steps:

1. Prepare environment

1-0. Fetch source code

git clone git@github.com:komushi/aws-glue-study.git

1-1. Create S3 bucket and upload CloudFormation configuration files

Login to AWS management console then:

Service -> S3 -> Create Bucket (E.g.: fight2016csv), choose proper name as you like
Create folder with name: cfn

fight2016csv/
  +-- cfn

Upload "iam.yaml" / "sg.yaml" / "vpc.yaml" files under "cfn" folder under repository cloned in 1-0
Copy full path of each files, saving for later use

1-2. Edit master.yaml file

Open "master.yaml" under "cfn" with proper editor
Replace URL links ended with "iam.yaml" / "sg.yaml" / "vpc.yaml" with their couterparts copied above and save

1-3. Use CloudFormation to prepare environment

Service -> CloudFormation -> Create new task
Select template -> Upload template to Amazon S3
Browse and select "master.yaml", which is just edited and saved.
Stack name: aws-glue-study
Input proper UserName and UserPassword (Blank value will cause creation failure!!)
Check on: AWS CloudFormation によってカスタム名のついた IAM リソースが作成される場合があることを承認します。
Wait until status changed to "CREATE_COMPLETE"

You will see stacks like:
aws-glue-study-SecurityGroupStack-163O1XHJDA4S1 [NESTED]
aws-glue-study-IAMStack-MBEUULSULYPQ [NESTED]
aws-glue-study-VPCStack-GR86W4ABT8H1 [NESTED]
aws-glue-study

2. Play with the original data

2-1. Crawl the original data by creating a table - flight_data.csv - by the AWS Glue Crawlers GUI

Service -> AWS Glue -> Crawlers -> Add Crawler
Crawler name: crawl-public-flight-2016-csv
Data store: s3://crawler-public-ap-northeast-1/flight/2016/csv/
Choose an existing IAM role: AWSGlueServiceRole created by CloudFormation

aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX

Frequency: Run on demand
Output Database: flight_data
Run script and wait until finish

2-2. Query the data at Athena

Service -> Athena -> Execute query

select count(1) from csv

3. Import into private s3 bucket as csv

3-0. Prepare Data sink by creating folders

Under bucket created in 1-1, create two other folders as follows as last image

fight2016csv/
  +-- cfn
  +-- scripts    # new
  +-- data       # new

3-1. Run ETL Job into private csv data on S3 by the AWS Glue Jobs GUI

Service -> AWS Glue -> Jobs -> Add job
Name: job_import_csv
IAM Role: AWSGlueServiceRole created by CloudFormation

aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX

Script: Use the proposed script by AWS Glue

  ◉ A proposed script generated by AWS Glue
  ◉ Python

Script file name: job_import_csv
S3 path where the script is stored

  s3://fight2016csv/scripts               # Created at 3-0

Data source: csv - create at Step 2-1
Data target: Create a gzip-compressed CSV table in your own S3 table

 ◉ Create tables in your data target
    Data store: Amazon S3
    Format: CSV
    Compression type: gzip
    Target path: s3://fight2016csv/data/   # Created at 3-0

Save job and edit scripts
Run job

3-2. Crawl to create a table - flight_data.private_csv

Crawlers -> Add crawler
Crawler name: crawl-private-flight-2016-csv
Data store: choose the folder specified at 3-1 as data target

Data store: S3
Crawl data in
  ◉ Specified path in my account
  Include path: s3://fight2016csv/data/    # data generated at 3-1

Choose an existing IAM role: AWSGlueServiceRole created by CloudFormation

aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX

Frequency: Run on demand
Output Database: flight_data
Prefix: private_
Finish and run it now, wait until finish

3-3. Query the data

Service -> Athena -> Query editor

select count(1) from private_data

4. ETL to Parquet

4-0. Prepare parquet folders

Service -> S3 -> fight2016csv -> Create folder -> flight_parquet

fight2016csv/
  +-- cfn
  +-- scripts
  +-- data
  +-- flight_parquet    # new

4-1. Run ETL Job to convert flight_data.flight_csv into private parquet data on S3 - by glue crawler GUI, only data for January

Service -> AWS Glue -> ETL -> Jobs -> Add job
Name: conver-to-parquet
IAM Role: AWSGlueServiceRole created by CloudFormation

aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX

Script: Use the proposed script by AWS Glue

This job runs
  ◉ A proposed script generated by AWS Glue
  ◉ Python

Script file name: conver-to-parquet
Store script in the "scripts" folder that is created at 3-0

S3 path where the script is stored:
s3://fight2016csv/scripts/

Data source: private_data

 ◉ private_data

Data target: parquet in S3

Data store: Amazon S3
Format: Parquet
Target path: s3://fight2016csv/flight_parquet/

Save job and run it until finish

4-2. Crawl to create a table - flight_data.flight_parquet

Crawlers -> Add crawler
Crawler name: crawl-private-parquet
Data store: S3
Crawl data generated at 4-1

Crawl data in:
◉ Specified path in my account
  Include path: s3://fight2016csv/flight_parquet/

IAM Role: AWSGlueServiceRole created by CloudFormation

aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX

Frequency: Run on demand
Database: flight_data
Finish and run it now, wait until finish

4-3. Query the data

Service -> Athena -> Execute query

select count(1) from flight_parquet

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cfn		cfn
README.MD		README.MD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

1. Prepare environment

1-0. Fetch source code

1-1. Create S3 bucket and upload CloudFormation configuration files

1-2. Edit master.yaml file

1-3. Use CloudFormation to prepare environment

2. Play with the original data

2-1. Crawl the original data by creating a table - flight_data.csv - by the AWS Glue Crawlers GUI

2-2. Query the data at Athena

3. Import into private s3 bucket as csv

3-0. Prepare Data sink by creating folders

3-1. Run ETL Job into private csv data on S3 by the AWS Glue Jobs GUI

3-2. Crawl to create a table - flight_data.private_csv

3-3. Query the data

4. ETL to Parquet

4-0. Prepare parquet folders

4-1. Run ETL Job to convert flight_data.flight_csv into private parquet data on S3 - by glue crawler GUI, only data for January

4-2. Crawl to create a table - flight_data.flight_parquet

4-3. Query the data

About

Uh oh!

Releases

Packages

Uh oh!

imadeit/aws-glue-study

Folders and files

Latest commit

History

Repository files navigation

1. Prepare environment

1-0. Fetch source code

1-1. Create S3 bucket and upload CloudFormation configuration files

1-2. Edit master.yaml file

1-3. Use CloudFormation to prepare environment

2. Play with the original data

2-1. Crawl the original data by creating a table - flight_data.csv - by the AWS Glue Crawlers GUI

2-2. Query the data at Athena

3. Import into private s3 bucket as csv

3-0. Prepare Data sink by creating folders

3-1. Run ETL Job into private csv data on S3 by the AWS Glue Jobs GUI

3-2. Crawl to create a table - flight_data.private_csv

3-3. Query the data

4. ETL to Parquet

4-0. Prepare parquet folders

4-1. Run ETL Job to convert flight_data.flight_csv into private parquet data on S3 - by glue crawler GUI, only data for January

4-2. Crawl to create a table - flight_data.flight_parquet

4-3. Query the data

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Packages