Skip to content

imadeit/aws-glue-study

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 

Repository files navigation

https://s3.console.aws.amazon.com/s3/buckets/crawler-public-ap-northeast-1/flight/

s3://crawler-public-ap-northeast-1/flight/2016/csv/

Steps:

1. Prepare environment

1-0. Fetch source code

git clone git@github.com:komushi/aws-glue-study.git

1-1. Create S3 bucket and upload CloudFormation configuration files

Login to AWS management console then:

  • Service -> S3 -> Create Bucket (E.g.: fight2016csv), choose proper name as you like
  • Create folder with name: cfn
fight2016csv/
  +-- cfn
  • Upload "iam.yaml" / "sg.yaml" / "vpc.yaml" files under "cfn" folder under repository cloned in 1-0
  • Copy full path of each files, saving for later use

1-2. Edit master.yaml file

  • Open "master.yaml" under "cfn" with proper editor
  • Replace URL links ended with "iam.yaml" / "sg.yaml" / "vpc.yaml" with their couterparts copied above and save

1-3. Use CloudFormation to prepare environment

  • Service -> CloudFormation -> Create new task
  • Select template -> Upload template to Amazon S3
  • Browse and select "master.yaml", which is just edited and saved.
  • Stack name: aws-glue-study
  • Input proper UserName and UserPassword (Blank value will cause creation failure!!)
  • Check on: AWS CloudFormation によってカスタム名のついた IAM リソースが作成される場合があることを承認します。
  • Wait until status changed to "CREATE_COMPLETE"
You will see stacks like:
aws-glue-study-SecurityGroupStack-163O1XHJDA4S1 [NESTED]
aws-glue-study-IAMStack-MBEUULSULYPQ [NESTED]
aws-glue-study-VPCStack-GR86W4ABT8H1 [NESTED]
aws-glue-study

2. Play with the original data

2-1. Crawl the original data by creating a table - flight_data.csv - by the AWS Glue Crawlers GUI

  • Service -> AWS Glue -> Crawlers -> Add Crawler
  • Crawler name: crawl-public-flight-2016-csv
  • Data store: s3://crawler-public-ap-northeast-1/flight/2016/csv/
  • Choose an existing IAM role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
  • Frequency: Run on demand
  • Output Database: flight_data
  • Run script and wait until finish

2-2. Query the data at Athena

  • Service -> Athena -> Execute query
select count(1) from csv

3. Import into private s3 bucket as csv

3-0. Prepare Data sink by creating folders

  • Under bucket created in 1-1, create two other folders as follows as last image
fight2016csv/
  +-- cfn
  +-- scripts    # new
  +-- data       # new

3-1. Run ETL Job into private csv data on S3 by the AWS Glue Jobs GUI

  • Service -> AWS Glue -> Jobs -> Add job
  • Name: job_import_csv
  • IAM Role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
  • Script: Use the proposed script by AWS Glue
  ◉ A proposed script generated by AWS Glue
  ◉ Python
  • Script file name: job_import_csv
  • S3 path where the script is stored
  s3://fight2016csv/scripts               # Created at 3-0
  • Data source: csv - create at Step 2-1
  • Data target: Create a gzip-compressed CSV table in your own S3 table
 ◉ Create tables in your data target
    Data store: Amazon S3
    Format: CSV
    Compression type: gzip
    Target path: s3://fight2016csv/data/   # Created at 3-0
  • Save job and edit scripts
  • Run job

3-2. Crawl to create a table - flight_data.private_csv

  • Crawlers -> Add crawler
  • Crawler name: crawl-private-flight-2016-csv
  • Data store: choose the folder specified at 3-1 as data target
Data store: S3
Crawl data in
  ◉ Specified path in my account
  Include path: s3://fight2016csv/data/    # data generated at 3-1
  • Choose an existing IAM role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
  • Frequency: Run on demand
  • Output Database: flight_data
  • Prefix: private_
  • Finish and run it now, wait until finish

3-3. Query the data

  • Service -> Athena -> Query editor
select count(1) from private_data

4. ETL to Parquet

4-0. Prepare parquet folders

  • Service -> S3 -> fight2016csv -> Create folder -> flight_parquet
fight2016csv/
  +-- cfn
  +-- scripts
  +-- data
  +-- flight_parquet    # new

4-1. Run ETL Job to convert flight_data.flight_csv into private parquet data on S3 - by glue crawler GUI, only data for January

  • Service -> AWS Glue -> ETL -> Jobs -> Add job
  • Name: conver-to-parquet
  • IAM Role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
  • Script: Use the proposed script by AWS Glue
This job runs
  ◉ A proposed script generated by AWS Glue
  ◉ Python
  • Script file name: conver-to-parquet
  • Store script in the "scripts" folder that is created at 3-0
S3 path where the script is stored:
s3://fight2016csv/scripts/
  • Data source: private_data
 ◉ private_data
  • Data target: parquet in S3
Data store: Amazon S3
Format: Parquet
Target path: s3://fight2016csv/flight_parquet/
  • Save job and run it until finish

4-2. Crawl to create a table - flight_data.flight_parquet

  • Crawlers -> Add crawler
  • Crawler name: crawl-private-parquet
  • Data store: S3
  • Crawl data generated at 4-1
Crawl data in:
◉ Specified path in my account
  Include path: s3://fight2016csv/flight_parquet/
  • IAM Role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
  • Frequency: Run on demand
  • Database: flight_data
  • Finish and run it now, wait until finish

4-3. Query the data

  • Service -> Athena -> Execute query
select count(1) from flight_parquet

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published