HDFS

Apache Hadoop includes a distributed file system called "HDFS" which we plan to use in some incarnation in Grappa.

For now, we have the latest stable version of Hadoop downloaded from hadoop.apache.org, installed at: /sampa/share/hadoop-1.0.3

How to get it back up and running

For a variety of reasons, our HDFS stuff dies sometimes or needs kicking to get it working again. Here's the list of commands I typically run to reboot it:

# from 'n71.sampa'
/sampa/share/hadoop-1.0.3/bin/stop-dfs.sh
/sampa/share/hadoop-1.0.3/bin/start-dfs.sh
ssh n69
# from 'n69'
/sampa/share/polysh/polysh.py `sinfo -p grappa -o '%n' -h`
# from 'polysh' prompt:
ready (12)> sudo /sampa/share/hadoop-1.0.3/bin/stop-fuse-dfs.sh
ready (12)> :hide_password
<enter password>
ready (12)> sudo /sampa/share/hadoop-1.0.3/bin/start-fuse-dfs.sh

Environment variables:

JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64
HADOOP_HOME=/sampa/share/hadoop-1.0.3
LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$JAVA_HOME/jre/lib/amd64/server:$HADOOP_HOME/c++/Linux-amd64-64/lib:$HADOOP_HOME/lib/native/Linux-amd64-64
CLASSPATH=$(echo $HADOOP_HOME/*.jar | tr ' ' ':'):$(echo $HADOOP_HOME/lib/*.jar | tr ' ' ':'):$HADOOP_HOME/conf

Building

Include/library flags:

-I$(HADOOP_HOME)/c++/Linux-amd64-64/include
-I$(HADOOP_HOME)/src/c++/libhdfs
-I$(JAVA_HOME)/include
-L$(HADOOP_HOME)/c++/Linux-amd64-64/lib
-L$(JAVA_HOME)/jre/lib/amd64/server
-lhdfs
-ljvm

Daemons

Configuration files for HDFS are in $(HADOOP_HOME)/conf.

masters: n71.sampa
slaves: [grappa nodes]?
core-site.xml, mapred-site.xml, hdfs-site.xml: Configure various things like:
- where daemons run (n71 & all Grappa nodes)
- block size, amount of memory for caching, etc.
- amount of duplication (1)
- where hadoop files go on each of the 'slaves' (/scratch/hadoop.{name,data})

Startup/teardown

> cd $(HADOOP_HOME)
# (note: if things are going wrong, must first physically delete HDFS data on all nodes)
    # start shell on all grappa nodes (assuming that's where HDFS's data lives)
    > clush -bw `sinfo -p grappa -o '%N' -h`
    > rm -rf /scratch/hadoop.data
    > quit
# set up & format HDFS (need to do this the first time)
> bin/hadoop namenode -format
# ssh to master node
> ssh n71.sampa  
# start dfs daemons (should see nameservers & dataservers start up)
# note: this must be called from the master node or else the nameserver will be running in the wrong place
> bin/start-dfs.sh
# shutdown
> bin/stop-dfs.sh

Interact with FS on cmdline

You can't interact with the HDFS stuff directly, so you have to go through the Hadoop executable. Note: it seems to work best to give an "absolute" path for HDFS destinations ("/" refers to the root of HDFS's filesystem).

# 'ls'
> $(HADOOP_HOME)/bin/hadoop dfs -ls /grappa_ckpts
# Copy files into HDFS, they should get distributed across 
> $(HADOOP_HOME)/bin/hadoop dfs -put <localfile> <dst>
# List the rest of the available filesystem commands
> $(HADOOP_HOME)/bin/hadoop dfs

WebHDFS

Allows for access over HTTP
Built into hadoop v1.0.3 and integrated with the DFS NameNodes and DataNodes, so no extra servers need to be fired up
To enable, add the following to conf/hdfs-site.xml (and restart dfs servers):

<property>
    <name>dfs.webhdfs.enabled</name>
    <value>true</value>
</property>

Fuse DFS

For Hadoop v1.0.3, the code for fuse-dfs can be found in /src/contrib/fuse-dfs.

Building:

# in $HADOOP_HOME/src/contrib/fuse-dfs
# make sure `$JAVA_HOME` and `$HADOOP_HOME` env. variables are set correctly
> ./configure LDFLAGS="-L$HADOOP_HOME/c++/Linux-amd64-64/lib -L$JAVA_HOME/jre/lib/amd64/server" CFLAGS="-I$HADOOP_HOME/src/c++/libhdfs"
> make PERMS=1
# executable `fuse_dfs` should be built in fuse-dfs/src

Running:

Find fuse_dfs_wrapper.sh in fuse-dfs/src, edit the paths in there to reflect your system
Find out which port the namenode is listening on. I think the default for v1.0.3 is 8020, but you can also check in the NameNode's log file by searching for this line:

2012-09-07 11:09:28,854 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: n71.sampa/10.1.2.71:8020

> grep -R 'Namenode up at' $HADOOP_HOME/logs/

Test out your configuration:

# in $HADOOP_HOME/src/contrib/fuse-dfs/src
> sudo ./fuse_dfs_wrapper.sh dfs://<namenode-hostname>:<namenode-port> <mount-point>
# (for example)
> mkdir /scratch/hdfs
> sudo ./fuse_dfs_wrapper.sh dfs://n71.sampa:8020 /scratch/hdfs
# ignore the warning 'fuse-dfs didn't recognize /scratch/hdfs,-2', it apparently says that no matter what
# check that it's working:
> ls /scratch/hdfs
# I have made some simple scripts to start and stop fuse-dfs nodes when they go down.
# You'll know they've gone down if they say "Transport endpoint is not connected."
# To restart, just ssh to the node and run:
> sudo /sampa/share/hadoop-1.0.3/bin/stop-fuse-dfs.sh
> sudo /sampa/share/hadoop-1.0.3/bin/start-fuse-dfs.sh

Debugging

Fuse logs things in /var/log/messages, so check there for messages
- ERROR: could not connect to n71.sampa:50070 fuse_impls_getattr.c:37 meant that I had the wrong port
Input/output error (ls: cannot access /scratch/hdfs: Input/output error)
- Might have the wrong port. Check the NameNode log for the port (see above)
- Kill the ./fuse_dfs process
- Clean up the mounted fs: sudo umount -l /scratch/hdfs (if you don't, you'll get errors that say Transport endpoint is not connected)

Automount

Should be able to add the following to /etc/fstab, if the wrapper_script is on your path and named fuse_dfs.

fuse_dfs#dfs://<namenode>:<port> /mountpoint fuse usertrash,rw 0 0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly