This comprehensive guide will walk you through installing and configuring an Apache Hadoop cluster, covering both single-node (pseudo-distributed) and multi-node (fully distributed) setups.
- Prerequisites
- Download and Installation
- Environment Configuration
- Hadoop Configuration
- Single-Node Setup (Pseudo-Distributed)
- Multi-Node Setup (Fully Distributed)
- Starting the Cluster
- Verification and Testing
- Common Issues and Troubleshooting
- Useful Commands
- Operating System: Linux, macOS, or Windows (with WSL)
- Java: OpenJDK 8, 11, or 17 (recommended: OpenJDK 11)
- Memory: Minimum 4GB RAM (8GB+ recommended for multi-node)
- Disk Space: Minimum 20GB available space
- Network: SSH access between nodes (for multi-node setup)
- Java Development Kit (JDK)
- SSH server and client
- rsync (for file synchronization)
| Version | Release Date | Binary Download |
|---|---|---|
| 3.4.1 | 2024 Oct 18 | hadoop-3.4.1.tar.gz |
# Download Hadoop 3.4.1
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz
# Verify the download (optional)
wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz.sha512
shasum -a 512 -c hadoop-3.4.1.tar.gz.sha512# Extract the archive
tar -xzf hadoop-3.4.1.tar.gz
# Move to installation directory (optional)
sudo mv hadoop-3.4.1 /opt/hadoop
# OR keep it in your preferred location
mv hadoop-3.4.1 ~/hadoopsudo apt update
sudo apt install openjdk-11-jdksudo yum install java-11-openjdk-develbrew install openjdk@11Add the following to your ~/.bashrc, ~/.zshrc, or ~/.profile:
# Java Environment
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64 # Linux
# export JAVA_HOME=/opt/homebrew/Cellar/openjdk@11/11.0.XX/libexec/openjdk.jdk/Contents/Home # macOS
# Hadoop Environment
export HADOOP_HOME=/opt/hadoop # or your installation path
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
# Add Hadoop binaries to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbinApply the changes:
source ~/.bashrc # or ~/.zshrc# Generate SSH key pair (if not exists)
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
# Add public key to authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
# Set appropriate permissions
chmod 0600 ~/.ssh/authorized_keys
# Test SSH to localhost
ssh localhostNavigate to the Hadoop configuration directory:
cd $HADOOP_HOME/etc/hadoop# Edit hadoop-env.sh
vim hadoop-env.sh
# Add or update the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
<description>The default file system URI</description>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/opt/hadoop/tmp</value>
<description>Temporary directory for Hadoop</description>
</property>
</configuration><configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/data/namenode</value>
<description>Directory for namenode metadata</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/opt/hadoop/data/datanode</value>
<description>Directory for datanode data</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication</description>
</property>
<property>
<name>dfs.namenode.checkpoint.dir</name>
<value>/opt/hadoop/data/secondary</value>
<description>Secondary namenode checkpoint directory</description>
</property>
</configuration><configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
<description>MapReduce framework name</description>
</property>
<property>
<name>mapreduce.application.classpath</name>
<value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
</property>
</configuration><configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
<description>Auxiliary services for NodeManager</description>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>localhost</value>
<description>ResourceManager hostname</description>
</property>
<property>
<name>yarn.nodemanager.env-whitelist</name>
<value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
</property>
</configuration># Create data directories
sudo mkdir -p /opt/hadoop/data/namenode
sudo mkdir -p /opt/hadoop/data/datanode
sudo mkdir -p /opt/hadoop/data/secondary
sudo mkdir -p /opt/hadoop/tmp
# Set appropriate ownership
sudo chown -R $USER:$USER /opt/hadoop/data
sudo chown -R $USER:$USER /opt/hadoop/tmphdfs namenode -format -force# Start HDFS
start-dfs.sh
# Start YARN
start-yarn.sh
# Or start all services at once
start-all.sh# Check running processes
jps
# Expected output should include:
# - NameNode
# - DataNode
# - ResourceManager
# - NodeManager
# - SecondaryNameNode- Multiple machines with Hadoop installed
- Network connectivity between all nodes
- SSH access from master to all slave nodes
- Same username on all nodes
- Synchronized time across all nodes
<property>
<name>fs.defaultFS</name>
<value>hdfs://master-node:9000</value>
</property><property>
<name>yarn.resourcemanager.hostname</name>
<value>master-node</value>
</property># Edit $HADOOP_HOME/etc/hadoop/workers
vim $HADOOP_HOME/etc/hadoop/workers
# Add worker node hostnames (one per line)
worker-node-1
worker-node-2
worker-node-3- Copy the entire Hadoop configuration from master to all slave nodes:
scp -r $HADOOP_HOME/etc/hadoop/ user@worker-node:/opt/hadoop/etc/- Update
hdfs-site.xmlon slaves to point to master:
<property>
<name>dfs.namenode.name.dir</name>
<value>/opt/hadoop/data/namenode</value>
</property>Update /etc/hosts on all nodes:
# Add entries for all nodes
192.168.1.100 master-node
192.168.1.101 worker-node-1
192.168.1.102 worker-node-2
192.168.1.103 worker-node-3From the master node:
# Format namenode (first time only)
hdfs namenode -format
# Start the cluster
start-all.shUse the convenient control script included in this repository:
# Start the cluster
./hadoop-control.sh start
# Stop the cluster
./hadoop-control.sh stop
# Restart the cluster
./hadoop-control.sh restart
# Check cluster status
./hadoop-control.sh status
# Show help
./hadoop-control.sh helpThis script provides:
- ✅ Colored output for easy reading
- ✅ Automatic error checking and validation
- ✅ Smart service detection - knows what's running
- ✅ Web interface links when cluster is healthy
- ✅ Safe start/stop procedures with proper sequencing
# Start all Hadoop services
start-all.sh
# Check if all services are running
jps# 1. Start HDFS services (NameNode, DataNode, SecondaryNameNode)
start-dfs.sh
# 2. Start YARN services (ResourceManager, NodeManager)
start-yarn.sh
# 3. Start MapReduce Job History Server (optional)
mapred --daemon start historyserver# Start NameNode
hdfs --daemon start namenode
# Start DataNode
hdfs --daemon start datanode
# Start SecondaryNameNode
hdfs --daemon start secondarynamenode
# Start ResourceManager
yarn --daemon start resourcemanager
# Start NodeManager
yarn --daemon start nodemanager
# Start Job History Server
mapred --daemon start historyserver# Stop all Hadoop services
stop-all.sh# Stop YARN services
stop-yarn.sh
# Stop HDFS services
stop-dfs.sh
# Stop Job History Server
mapred --daemon stop historyserver# Stop Job History Server
mapred --daemon stop historyserver
# Stop NodeManager
yarn --daemon stop nodemanager
# Stop ResourceManager
yarn --daemon stop resourcemanager
# Stop SecondaryNameNode
hdfs --daemon stop secondarynamenode
# Stop DataNode
hdfs --daemon stop datanode
# Stop NameNode
hdfs --daemon stop namenode# Restart all services
stop-all.sh && start-all.sh
# Restart HDFS only
stop-dfs.sh && start-dfs.sh
# Restart YARN only
stop-yarn.sh && start-yarn.sh# List all Hadoop-related Java processes
jps
# Expected output should include:
# 12345 NameNode
# 12346 DataNode
# 12347 SecondaryNameNode
# 12348 ResourceManager
# 12349 NodeManager
# 12350 JobHistoryServer (if started)# Check if NameNode is running
hdfs dfsadmin -report
# Check if YARN is running
yarn node -list
# Check cluster health
hdfs dfsadmin -safemode get# Create your user directory in HDFS
hdfs dfs -mkdir -p /user/$USER
# Create a test directory
hdfs dfs -mkdir /user/$USER/test
# List HDFS root directory
hdfs dfs -ls /
# List your user directory
hdfs dfs -ls /user/$USER
# Check HDFS health
hdfs fsck /# Create a test file locally
echo "Hello Hadoop World!" > test.txt
echo "This is a test file for Hadoop HDFS" >> test.txt
# Upload file to HDFS
hdfs dfs -put test.txt /user/$USER/
# List files in HDFS
hdfs dfs -ls /user/$USER/
# View file content in HDFS
hdfs dfs -cat /user/$USER/test.txt
# Download file from HDFS
hdfs dfs -get /user/$USER/test.txt downloaded_test.txt
# Verify the downloaded file
cat downloaded_test.txt
# Clean up test files
rm test.txt downloaded_test.txt
hdfs dfs -rm /user/$USER/test.txt# Create input directory for MapReduce
hdfs dfs -mkdir /input
# Copy Hadoop configuration files as input
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /input/
# Run the word count example
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output
# Check the output
hdfs dfs -ls /output/
hdfs dfs -cat /output/part-r-00000 | head -20
# Clean up
hdfs dfs -rm -r /output
hdfs dfs -rm -r /input# Run a simple YARN application (Pi calculation)
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 2 100
# Check application history
yarn application -list -appStates ALL# TestDFSIO Write Test
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -write -nrFiles 4 -fileSize 128MB
# TestDFSIO Read Test
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -read -nrFiles 4 -fileSize 128MB
# Clean up test data
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -clean# Open web interfaces (run these commands to get URLs)
echo "NameNode Web UI: http://localhost:9870"
echo "ResourceManager Web UI: http://localhost:8088"
echo "Job History Server: http://localhost:19888"
echo "DataNode Web UI: http://localhost:9864"
echo "NodeManager Web UI: http://localhost:8042"# Check cluster summary
hdfs dfsadmin -report
# Check filesystem
hdfs fsck /
# Monitor YARN applications
yarn top
# Check node status
yarn node -list -showDetails
# View cluster metrics
yarn cluster -lnlI've created a comprehensive script to manage cluster nodes:
# Make the script executable (if not already)
chmod +x ./manage-cluster-nodes.sh
# Show all available commands
./manage-cluster-nodes.sh help# List all current nodes
./manage-cluster-nodes.sh list
# Show detailed cluster status
./manage-cluster-nodes.sh status# Add a new worker node
./manage-cluster-nodes.sh add worker-node-1
# Add node by IP address
./manage-cluster-nodes.sh add 192.168.1.100# Step 1: Safely decommission the node
./manage-cluster-nodes.sh decommission worker-node-1
# Step 2: Remove from cluster (after decommissioning completes)
./manage-cluster-nodes.sh remove worker-node-1# Bring back a decommissioned node
./manage-cluster-nodes.sh recommission worker-node-1# Convert single-node to multi-node setup
./manage-cluster-nodes.sh convert-multi
# Convert multi-node back to single-node
./manage-cluster-nodes.sh convert-single# Backup current configuration
./manage-cluster-nodes.sh backup
# Restore from backup
./manage-cluster-nodes.sh restore-
Add node to cluster configuration:
./manage-cluster-nodes.sh add worker-node-2
-
Set up SSH passwordless access:
# Copy SSH key to new node ssh-copy-id user@worker-node-2 # Test SSH access ssh worker-node-2
-
Install Hadoop on the new node:
# Copy Hadoop installation to new node scp -r $HADOOP_HOME user@worker-node-2:/opt/
-
Copy configuration files:
# Copy configuration to new node scp -r $HADOOP_HOME/etc/hadoop/* user@worker-node-2:$HADOOP_HOME/etc/hadoop/
-
Refresh cluster nodes:
# Refresh YARN nodes yarn rmadmin -refreshNodes # Refresh HDFS nodes hdfs dfsadmin -refreshNodes
-
Start services on new node:
# On the new worker node, start DataNode and NodeManager ssh worker-node-2 "$HADOOP_HOME/bin/hdfs --daemon start datanode" ssh worker-node-2 "$HADOOP_HOME/bin/yarn --daemon start nodemanager"
-
Decommission the node:
./manage-cluster-nodes.sh decommission worker-node-2
-
Monitor decommissioning progress:
# Check HDFS decommissioning status hdfs dfsadmin -report # Check YARN node status yarn node -list -all
-
Wait for decommissioning to complete (data blocks are moved to other nodes)
-
Remove the node:
./manage-cluster-nodes.sh remove worker-node-2
-
Stop services on the removed node:
ssh worker-node-2 "$HADOOP_HOME/bin/yarn --daemon stop nodemanager" ssh worker-node-2 "$HADOOP_HOME/bin/hdfs --daemon stop datanode"
The script automatically manages these files:
$HADOOP_HOME/etc/hadoop/workers- List of worker nodes$HADOOP_HOME/etc/hadoop/core-site.xml- Core configuration$HADOOP_HOME/etc/hadoop/yarn-site.xml- YARN configuration$HADOOP_HOME/etc/hadoop/dfs.exclude- HDFS decommission list$HADOOP_HOME/etc/hadoop/yarn.exclude- YARN decommission list
All configuration changes are automatically backed up to:
$HADOOP_HOME/backups/hadoop_config_YYYYMMDD_HHMMSS.tar.gz
You can restore any backup using:
./manage-cluster-nodes.sh restoreFor a complete quick reference guide, see: CLUSTER-MANAGEMENT.md
# Check HDFS status
hdfs dfsadmin -report
# Check YARN nodes
yarn node -list
# Check running processes
jps- HDFS NameNode: http://localhost:9870
- YARN ResourceManager: http://localhost:8088
- MapReduce Job History: http://localhost:19888
# Create directories in HDFS
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/$USER
# List HDFS contents
hdfs dfs -ls /
# Copy file to HDFS
hdfs dfs -put /path/to/local/file /user/$USER/
# Copy file from HDFS
hdfs dfs -get /user/$USER/file /path/to/local/# Create input directory
hdfs dfs -mkdir /input
# Copy input files
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /input
# Run word count example
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.1.jar wordcount /input /output
# Check output
hdfs dfs -cat /output/part-r-00000# Verify Java installation
java -version
echo $JAVA_HOME
# Ensure JAVA_HOME is set in hadoop-env.sh# Test SSH connectivity
ssh localhost
ssh worker-node-1
# Check SSH key setup
ls -la ~/.ssh/# Fix directory permissions
sudo chown -R $USER:$USER $HADOOP_HOME
chmod 755 $HADOOP_HOME/data/*# Check what's using the port
netstat -tulpn | grep :9000
lsof -i :9000
# Kill the process if necessary
kill -9 <PID># Check logs
tail -f $HADOOP_HOME/logs/hadoop-*-datanode-*.log
# Common solution: Remove and reformat
stop-all.sh
rm -rf $HADOOP_HOME/data/datanode/*
hdfs namenode -format -force
start-all.sh# Hadoop logs directory
$HADOOP_HOME/logs/
# Important log files
hadoop-*-namenode-*.log
hadoop-*-datanode-*.log
yarn-*-resourcemanager-*.log
yarn-*-nodemanager-*.log# File system check
hdfs fsck /
# Safe mode operations
hdfs dfsadmin -safemode leave
hdfs dfsadmin -safemode enter
# Balance cluster
hdfs balancer
# Decommission nodes
hdfs dfsadmin -refreshNodes# List applications
yarn application -list
# Kill application
yarn application -kill <application_id>
# Node management
yarn node -list -all
yarn rmadmin -refreshNodes# Check cluster health
hdfs dfsadmin -report
yarn node -list -showDetails
# Monitor cluster
hadoop dfsadmin -printTopology
yarn top# Check disk usage
hdfs dfs -du -h /
# Monitor system resources
top
htop
iostat -x 1For issues and questions:
- Check the Hadoop Troubleshooting Guide
- Visit Apache Hadoop User Mailing List
- Submit issues to Apache Hadoop JIRA
Note: This guide is based on Hadoop 3.4.1. Configuration may vary slightly for different versions.