My first stab at Hadoop

Once you have cloned this repository, this guide/tutorial will take you through the steps to set up Hadoop and run two simple word count problems.

Software Dependencies

My software configuration is running it off ubuntu-12.04.

Virtual-box
Vagrant
Git

Note there are quite a few software dependencies such as Sun Java and of course Hadoop. But the cookbook already takes care of installing these.

Steps

To download and install the vagrant virtual box image just run vagrant up
To log into the box run vagrant ssh

For the following steps I have followed michael noll tutorials. I have found this guide to be really easy to follow and so I will just refer you to that site for steps.

Prerequisites section

The vagrant image comes with Sun Java 6 so we can skip this section.

Adding a dedicated Hadoop system user

Follow guide

Change ownerships to new user

sudo chown -R hduser:hadoop /etc/hadoop
sudo chown -R hduser:hadoop /usr/lib/hadoop

Configuring SSH

Follow guide

Disabling IPv6

Should be auto generated.

If you are getting binding errors (maybe around 0.0.0.0) then follow the guide

Hadoop

Installation

Hadoop is already installed on the image. The install directory is:

/usr/lib/hadoop
/usr/lib/hadoop-hdfs
/usr/lib/hadoop-mapreduce
/usr/lib/hadoop-yarn

And the configuration files can be found at /etc/hadoop/conf

However to keep this guide in sync with Michael's (which installs hadoop in /usr/local/hadoop), we shall add a symbolic link. Check you are user vagrant (if you are user hduser, just type exit, maybe twice), then

sudo ln -s /usr/lib/hadoop /usr/local/hadoop
sudo ln -s /usr/lib/hadoop/libexec/ /usr/lib/hadoop-hdfs/libexec
sudo ln -s /etc/hadoop/conf /usr/local/hadoop/conf

Update $HOME/.bashrc

No need updated attributes.

Configuration

Should be auto-generated

Formatting the HDFS filesystem via the NameNode

Just type hadoop namenode -format

Starting your single-node cluster

This is a little different to the way the guide starts it. As user hduser cd /usr/local/hadoop/sbin ./hadoop-daemon.sh start namenode ./hadoop-daemon.sh start datanode

This will startup a Namenode. To check it is running jps. You should see a Java process called namenode running.

Note if you get the following error: /usr/lib/hadoop-hdfs/bin/hdfs: line 34: /usr/lib/hadoop-hdfs/bin/../libexec/hdfs-config.sh: No such file or directory

Then one way to fix this is to create a symbolic link: sudo ln -s /usr/lib/hadoop/libexec/ /usr/lib/hadoop-hdfs/libexec

Stopping your single-node cluster

cd /usr/local/hadoop/sbin
./hadoop-daemon.sh stop namenode

A simple word count example to see if we can get hadoop to work

Create a file with some text mkdir ~/input echo "Could this guide actually be a useful guide" > ~/input/sample.txt
Create a folder in hadoop to store this file hdfs dfs -mkdir -p /user/hduser/sample To check it worked hdfs dfs -ls /user/hduser should show output similar to Found 1 items drwxr-xr-x - hduser supergroup 0 2014-12-23 15:28 /user/hduser/sample hduser@hadoop-ubuntu-12:~~$ hdfs dfs -ls /user/hduser/sample/ hduser@hadoop-ubuntu-12:~~$ hdfs dfs -ls /user/hduser/ Found 1 items drwxr-xr-x - hduser supergroup 0 2014-12-23 15:28 /user/hduser/sample
Upload the file into the hadoop filesystem hdfs dfs -copyFromLocal ~/input /user/hduser/sample

Run the MapReduce job

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples-2.2.0.2.0.11.0-1.jar wordcount /user/hduser/sample/input /user/hduser/sample-output

Extract the results

hdfs dfs -getmerge /user/hduser/sample-output ~/output

This will create a file output containing our list of words and the number of times they occured.

See the guide for details of how to view the output without extracting it from hadoops filesystem. Also it has a example using a larger test set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

My first stab at Hadoop

Software Dependencies

Steps

Prerequisites section

Adding a dedicated Hadoop system user

Change ownerships to new user

Configuring SSH

Disabling IPv6

Hadoop

Installation

Update $HOME/.bashrc

Configuration

Formatting the HDFS filesystem via the NameNode

Starting your single-node cluster

Stopping your single-node cluster

A simple word count example to see if we can get hadoop to work

Run the MapReduce job

Extract the results

Uh oh!

Uh oh!

Clone this wiki locally