Skip to content

Commit

Permalink
Started work on a workflow JSON
Browse files Browse the repository at this point in the history
  • Loading branch information
alexanderdean committed Jun 9, 2012
1 parent 673fcac commit 3d16d6f
Show file tree
Hide file tree
Showing 2 changed files with 61 additions and 2 deletions.
4 changes: 2 additions & 2 deletions hive/etl/daily-etl.rb
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,11 @@
# software distributed under the Apache License Version 2.0 is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the Apache License Version 2.0 for the specific language governing permissions and limitations there under.

#
# This Ruby script runs the daily ETL (extract, transform, load)
# process which transforms the raw CloudFront log data into
# SnowPlow-formatted Hive data tables, optimised for analysis.
#

# This is a three-step process:
# 1. Extract the CloudFront log files to a temporary SnowPlow event data table (using the custom Deserializer)
# 2. Load the temporary event data into the final SnowPlow data table, partitioning by date and user
Expand Down
59 changes: 59 additions & 0 deletions hive/etl/jobflow.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
[
{
"Name": "ETL Step 0: Create the SnowPlow event table as needed",
"ActionOnFailure": "TERMINATE_JOB_FLOW",
"HadoopJarStep": {
"Jar": "/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar",
"Args": [
"-input", "s3n://elasticmapreduce/samples/freebase/input/",
"-output", "s3n://<bucket>/freebase/step1out/",
"-mapper", "s3n://elasticmapreduce/samples/freebase/code/mapper.py"
]
}
},
{
"Name": "ETL Step 1: extract CloudFront log files to a temporary SnowPlow data table",
"ActionOnFailure": "TERMINATE_JOB_FLOW",
"HadoopJarStep": {
"Jar": "/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar",
"Args": [
"-input", "s3n://<bucket>/freebase/step1out/",
"-output", "s3n://<bucket>/freebase/step2out/",
"-mapper", "s3n://elasticmapreduce/samples/freebase/code/top_sdb_mapper.rb",
"-reducer", "s3n://elasticmapreduce/samples/freebase/code/top_sdb_reducer.rb",
"-cacheFile", "s3n://elasticmapreduce/samples/freebase/code/base64.rb#base64.rb",
"-cacheFile", "s3n://elasticmapreduce/samples/freebase/code/aws_sdb.rb#aws_sdb.rb"
]
}
},
{
"Name": "ETL Step 2: Load the temporary event data into the SnowPlow event table",
"ActionOnFailure": "TERMINATE_JOB_FLOW",
"HadoopJarStep": {
"Jar": "/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar",
"Args": [
"-input", "s3n://elasticmapreduce/samples/freebase/input/",
"-output", "s3n://<bucket>/freebase/names/step1",
"-mapper", "s3n://elasticmapreduce/samples/freebase/code/name_mapper.rb",
"-reducer", "s3n://elasticmapreduce/samples/freebase/code/name_reducer.rb",
"-cacheFile", "s3n://elasticmapreduce/samples/freebase/code/base64.rb#base64.rb",
"-cacheFile", "s3n://elasticmapreduce/samples/freebase/code/aws_sdb.rb#aws_sdb.rb"
]
}
},
{
"Name": "ETL Step 3: Delete the temporary event data",
"ActionOnFailure": "TERMINATE_JOB_FLOW",
"HadoopJarStep": {
"Jar": "/home/hadoop/contrib/streaming/hadoop-0.18-streaming.jar",
"Args": [
"-input", "s3n://elasticmapreduce/samples/freebase/input/",
"-output", "s3n://<bucket>/freebase/names/step1",
"-mapper", "s3n://elasticmapreduce/samples/freebase/code/name_mapper.rb",
"-reducer", "s3n://elasticmapreduce/samples/freebase/code/name_reducer.rb",
"-cacheFile", "s3n://elasticmapreduce/samples/freebase/code/base64.rb#base64.rb",
"-cacheFile", "s3n://elasticmapreduce/samples/freebase/code/aws_sdb.rb#aws_sdb.rb"
]
}
}
]

0 comments on commit 3d16d6f

Please sign in to comment.