Skip to content

sawantabhishek999/IntroToHadoopAndMR__Udacity_Course

Repository files navigation

Introduction to Hadoop and MapReduce


Introduction

This repository contains source code for the assignments of Udacity's course, Introduction to Hadoop and MapReduce, which was unveiled on 15th November, 2013.
This is a short course by Cloudera guys in association with Udacity. Instructors are Sarah Sproehnle and Ian Wrigley, both from Cloudera and Gundega Dekena, Course Developer is from Udacity.

Course does not mandate any programming language specifically for Hadoop MapReduce jobs, but they have mainly used / taught Hadoop MapReduce jobs using Python [i.e. with Hadoop Streaming approach for running jobs] during the course.
I have developed Hadoop MapReduce code for the 2 problem statements [with 3 questions each] in 2 programming languages; Python and Java.

Instructions for VM download / setup

Please refer instructions document for details on setup required for running these examples.

As mentioned in the above document, VM image with Hadoop can be downloaded from Udacity website. Please be forewarned, the size of this VM file is 1.7 GB. And it does not uncompress with either 7-Zip or Windows default Zip utility. Please use WinRAR or WinZip or even Cygwin unzip to uncompress the same, if you are on Windows. On other Operating Systems, probably unzip command might work just fine.

Data

Input Files

Input files for the problem statements ProblemStatement#1 and ProblemStatement#2 have also been uploaded to GitHub.
These compressed archives can also be downloaded from Udacity servers. Look here for input file for Problem Statement 1 and here for Problem Statement 2.
These links are also mentioned in the instructions document provided by Udacity Course Instructors.

Output Files

Output for the problem statements ProblemStatement#1 and ProblemStatement#2 have also been uploaded to this GitHub repo for quick reference and validation of the output.
This output is the Hadoop MR Job output which is obtained after processing and analyzing the specific question.

Execution steps are also documented for running the following in either Python or Java.

Question#1

Instead of breaking the sales down by store, instead give us a sales breakdown by product category across all of our stores.

  1. What is the value of total sales for the following categories?
    • Toys
    • Consumer Electronics

Code

Java variant

P1Q1.java

Python variant

P1Q1_Mapper.py and P1Q1_Reducer.py

Solution

Please check pur_p1q1.tsv for the output of this problem statement.

Execution Log files

Please check pur_p1q1.log and pur_p1q1.log for command line execution log files of Java and Python respectively.

Question#2

Find the monetary value for the highest individual sale for each separate store.

  1. What are the values for the following stores?
    • Reno
    • Toledo
    • Chandler

Code

Java variant

P1Q2.java

Python variant

P1Q2_Mapper.py and P1Q2_Reducer.py

Solution

Please check pur_p1q2.tsv for the output of this problem statement.

Execution Log files

Please check pur_p1q2.log and pur_p1q2.log for command line execution log files of Java and Python respectively.

Question#3

Find the total sales value across all the stores, and the total number of sales. Assume there is only one reducer.

  1. Find
    • Total sales value across all the stores
    • Total number of sales

Code

Java variant

P1Q3.java

Python variant

P1Q3_Mapper.py and P1Q3_Reducer.py

Solution

Please check pur_p1q3.tsv for the output of this problem statement.

Execution Log files

Please check pur_p1q3.log and pur_p1q3.log for command line execution log files of Java and Python respectively.

Execution steps are also documented for running the following in either Python or Java.

Question#1

Write a MapReduce program which will display the number of hits for each different file on the Web site.

  1. Find
    • How many hits were made to the page: /assets/js/the-associates.js?

Code

Java variant

P2Q1.java

Python variant

P2Q1_Mapper.py and P2Q1_Reducer.py

Solution

Please check acc_p2q1.tsv for the output of this problem statement.

Execution Log files

Please check acc_p2q1.log and acc_p2q1.log for command line execution log files of Java and Python respectively.

Question#2

Write a MapReduce program which determines the number of hits to the site made by each different IP Address.

  1. Find
    • How many hits were made by the IP address: 10.99.99.186?

Code

Java variant

P2Q2.java

Python variant

P2Q2_Mapper.py and P2Q2_Reducer.py

Solution

Please check acc_p2q2.tsv for the output of this problem statement.

Execution Log files

Please check acc_p2q2.log and acc_p2q2.log for command line execution log files of Java and Python respectively.

Question#3

Find the most popular file on the Web site. In other words, the file which had the most hits. Your Reducer should just write out the name of the file and number of hits into HDFS.

  1. Find
    • Full path to the most popular file?
    • Number of hits to that file?

Code

Java variant

P2Q3.java

Python variant

P2Q3_Mapper.py and P2Q3_Reducer.py

Solution

Please check acc_p2q3.tsv for the output of this problem statement.

Execution Log files

Please check acc_p2q3.log and acc_p2q3.log for command line execution log files of Java and Python respectively.

License

Copyright © 2013 Prashanth Babu.
Licensed under the Apache License, Version 2.0.

About

Source code for assignments of Udacity course "Introduction to Hadoop and MapReduce"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Java 49.8%
  • Python 36.0%
  • Shell 13.6%
  • C 0.6%