https://s3.console.aws.amazon.com/s3/buckets/crawler-public-ap-northeast-1/flight/
s3://crawler-public-ap-northeast-1/flight/2016/csv/
Steps:
git clone git@github.com:komushi/aws-glue-study.git
Login to AWS management console then:
- Service -> S3 -> Create Bucket (E.g.: fight2016csv), choose proper name as you like
- Create folder with name: cfn
fight2016csv/
+-- cfn
- Upload "iam.yaml" / "sg.yaml" / "vpc.yaml" files under "cfn" folder under repository cloned in 1-0
- Copy full path of each files, saving for later use
- Open "master.yaml" under "cfn" with proper editor
- Replace URL links ended with "iam.yaml" / "sg.yaml" / "vpc.yaml" with their couterparts copied above and save
- Service -> CloudFormation -> Create new task
- Select template -> Upload template to Amazon S3
- Browse and select "master.yaml", which is just edited and saved.
- Stack name: aws-glue-study
- Input proper UserName and UserPassword (Blank value will cause creation failure!!)
- Check on: AWS CloudFormation によってカスタム名のついた IAM リソースが作成される場合があることを承認します。
- Wait until status changed to "CREATE_COMPLETE"
You will see stacks like:
aws-glue-study-SecurityGroupStack-163O1XHJDA4S1 [NESTED]
aws-glue-study-IAMStack-MBEUULSULYPQ [NESTED]
aws-glue-study-VPCStack-GR86W4ABT8H1 [NESTED]
aws-glue-study
- Service -> AWS Glue -> Crawlers -> Add Crawler
- Crawler name: crawl-public-flight-2016-csv
- Data store: s3://crawler-public-ap-northeast-1/flight/2016/csv/
- Choose an existing IAM role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
- Frequency: Run on demand
- Output Database: flight_data
- Run script and wait until finish
- Service -> Athena -> Execute query
select count(1) from csv
- Under bucket created in 1-1, create two other folders as follows as last image
fight2016csv/
+-- cfn
+-- scripts # new
+-- data # new
- Service -> AWS Glue -> Jobs -> Add job
- Name: job_import_csv
- IAM Role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
- Script: Use the proposed script by AWS Glue
◉ A proposed script generated by AWS Glue
◉ Python
- Script file name: job_import_csv
- S3 path where the script is stored
s3://fight2016csv/scripts # Created at 3-0
- Data source: csv - create at Step 2-1
- Data target: Create a gzip-compressed CSV table in your own S3 table
◉ Create tables in your data target
Data store: Amazon S3
Format: CSV
Compression type: gzip
Target path: s3://fight2016csv/data/ # Created at 3-0
- Save job and edit scripts
- Run job
- Crawlers -> Add crawler
- Crawler name: crawl-private-flight-2016-csv
- Data store: choose the folder specified at 3-1 as data target
Data store: S3
Crawl data in
◉ Specified path in my account
Include path: s3://fight2016csv/data/ # data generated at 3-1
- Choose an existing IAM role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
- Frequency: Run on demand
- Output Database: flight_data
- Prefix: private_
- Finish and run it now, wait until finish
- Service -> Athena -> Query editor
select count(1) from private_data
- Service -> S3 -> fight2016csv -> Create folder -> flight_parquet
fight2016csv/
+-- cfn
+-- scripts
+-- data
+-- flight_parquet # new
4-1. Run ETL Job to convert flight_data.flight_csv into private parquet data on S3 - by glue crawler GUI, only data for January
- Service -> AWS Glue -> ETL -> Jobs -> Add job
- Name: conver-to-parquet
- IAM Role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
- Script: Use the proposed script by AWS Glue
This job runs
◉ A proposed script generated by AWS Glue
◉ Python
- Script file name: conver-to-parquet
- Store script in the "scripts" folder that is created at 3-0
S3 path where the script is stored:
s3://fight2016csv/scripts/
- Data source: private_data
◉ private_data
- Data target: parquet in S3
Data store: Amazon S3
Format: Parquet
Target path: s3://fight2016csv/flight_parquet/
- Save job and run it until finish
- Crawlers -> Add crawler
- Crawler name: crawl-private-parquet
- Data store: S3
- Crawl data generated at 4-1
Crawl data in:
◉ Specified path in my account
Include path: s3://fight2016csv/flight_parquet/
- IAM Role: AWSGlueServiceRole created by CloudFormation
aws-glue-study-IAMStack-MBEUULS-AWSGlueServiceRole-D38ZRN3RZCZX
- Frequency: Run on demand
- Database: flight_data
- Finish and run it now, wait until finish
- Service -> Athena -> Execute query
select count(1) from flight_parquet