Skip to content

Complete Apache Hadoop 3.4.1 cluster installation and management toolkit with automated scripts, comprehensive documentation, and production-ready configuration templates for single-node and multi-node deployments.

License

PramithaMJ/hadoop-cluster-manager

Repository files navigation

Apache Hadoop Cluster Installation Guide

This comprehensive guide will walk you through installing and configuring an Apache Hadoop cluster, covering both single-node (pseudo-distributed) and multi-node (fully distributed) setups.

Table of Contents

  1. Prerequisites
  2. Download and Installation
  3. Environment Configuration
  4. Hadoop Configuration
  5. Single-Node Setup (Pseudo-Distributed)
  6. Multi-Node Setup (Fully Distributed)
  7. Starting the Cluster
  8. Verification and Testing
  9. Common Issues and Troubleshooting
  10. Useful Commands

Prerequisites

System Requirements

  • Operating System: Linux, macOS, or Windows (with WSL)
  • Java: OpenJDK 8, 11, or 17 (recommended: OpenJDK 11)
  • Memory: Minimum 4GB RAM (8GB+ recommended for multi-node)
  • Disk Space: Minimum 20GB available space
  • Network: SSH access between nodes (for multi-node setup)

Required Software

  • Java Development Kit (JDK)
  • SSH server and client
  • rsync (for file synchronization)

Download and Installation

Step 1: Download Hadoop

Version Release Date Binary Download
3.4.1 2024 Oct 18 hadoop-3.4.1.tar.gz
# Download Hadoop 3.4.1
wget https://archive.apache.org/dist/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz

# Verify the download (optional)
wget https://downloads.apache.org/hadoop/common/hadoop-3.4.1/hadoop-3.4.1.tar.gz.sha512
shasum -a 512 -c hadoop-3.4.1.tar.gz.sha512

Step 2: Extract and Install

# Extract the archive
tar -xzf hadoop-3.4.1.tar.gz

# Move to installation directory (optional)
sudo mv hadoop-3.4.1 /opt/hadoop
# OR keep it in your preferred location
mv hadoop-3.4.1 ~/hadoop

Environment Configuration

Step 1: Install Java (if not already installed)

On Ubuntu/Debian:

sudo apt update
sudo apt install openjdk-11-jdk

On CentOS/RHEL:

sudo yum install java-11-openjdk-devel

On macOS:

brew install openjdk@11

Step 2: Configure Environment Variables

Add the following to your ~/.bashrc, ~/.zshrc, or ~/.profile:

# Java Environment
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64  # Linux
# export JAVA_HOME=/opt/homebrew/Cellar/openjdk@11/11.0.XX/libexec/openjdk.jdk/Contents/Home  # macOS

# Hadoop Environment
export HADOOP_HOME=/opt/hadoop  # or your installation path
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME

# Add Hadoop binaries to PATH
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin

Apply the changes:

source ~/.bashrc  # or ~/.zshrc

Step 3: Configure SSH (Required for cluster operations)

# Generate SSH key pair (if not exists)
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

# Add public key to authorized_keys
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

# Set appropriate permissions
chmod 0600 ~/.ssh/authorized_keys

# Test SSH to localhost
ssh localhost

Hadoop Configuration

Navigate to the Hadoop configuration directory:

cd $HADOOP_HOME/etc/hadoop

Step 1: Configure hadoop-env.sh

# Edit hadoop-env.sh
vim hadoop-env.sh

# Add or update the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

Step 2: Configure Core Components

core-site.xml

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
        <description>The default file system URI</description>
    </property>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/opt/hadoop/tmp</value>
        <description>Temporary directory for Hadoop</description>
    </property>
</configuration>

hdfs-site.xml

<configuration>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/opt/hadoop/data/namenode</value>
        <description>Directory for namenode metadata</description>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/opt/hadoop/data/datanode</value>
        <description>Directory for datanode data</description>
    </property>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>Default block replication</description>
    </property>
    <property>
        <name>dfs.namenode.checkpoint.dir</name>
        <value>/opt/hadoop/data/secondary</value>
        <description>Secondary namenode checkpoint directory</description>
    </property>
</configuration>

mapred-site.xml

<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
        <description>MapReduce framework name</description>
    </property>
    <property>
        <name>mapreduce.application.classpath</name>
        <value>$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*</value>
    </property>
</configuration>

yarn-site.xml

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
        <description>Auxiliary services for NodeManager</description>
    </property>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>localhost</value>
        <description>ResourceManager hostname</description>
    </property>
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME</value>
    </property>
</configuration>

Step 3: Create Required Directories

# Create data directories
sudo mkdir -p /opt/hadoop/data/namenode
sudo mkdir -p /opt/hadoop/data/datanode
sudo mkdir -p /opt/hadoop/data/secondary
sudo mkdir -p /opt/hadoop/tmp

# Set appropriate ownership
sudo chown -R $USER:$USER /opt/hadoop/data
sudo chown -R $USER:$USER /opt/hadoop/tmp

Single-Node Setup

Format the NameNode (First-time setup only)

hdfs namenode -format -force

Start Hadoop Services

# Start HDFS
start-dfs.sh

# Start YARN
start-yarn.sh

# Or start all services at once
start-all.sh

Verify the Installation

# Check running processes
jps

# Expected output should include:
# - NameNode
# - DataNode
# - ResourceManager
# - NodeManager
# - SecondaryNameNode

Multi-Node Setup

Prerequisites for Multi-Node

  1. Multiple machines with Hadoop installed
  2. Network connectivity between all nodes
  3. SSH access from master to all slave nodes
  4. Same username on all nodes
  5. Synchronized time across all nodes

Step 1: Configure Master Node

Update core-site.xml on master:

<property>
    <name>fs.defaultFS</name>
    <value>hdfs://master-node:9000</value>
</property>

Update yarn-site.xml on master:

<property>
    <name>yarn.resourcemanager.hostname</name>
    <value>master-node</value>
</property>

Configure workers file:

# Edit $HADOOP_HOME/etc/hadoop/workers
vim $HADOOP_HOME/etc/hadoop/workers

# Add worker node hostnames (one per line)
worker-node-1
worker-node-2
worker-node-3

Step 2: Configure Slave Nodes

  1. Copy the entire Hadoop configuration from master to all slave nodes:
scp -r $HADOOP_HOME/etc/hadoop/ user@worker-node:/opt/hadoop/etc/
  1. Update hdfs-site.xml on slaves to point to master:
<property>
    <name>dfs.namenode.name.dir</name>
    <value>/opt/hadoop/data/namenode</value>
</property>

Step 3: Network Configuration

Update /etc/hosts on all nodes:

# Add entries for all nodes
192.168.1.100   master-node
192.168.1.101   worker-node-1
192.168.1.102   worker-node-2
192.168.1.103   worker-node-3

Step 4: Start Multi-Node Cluster

From the master node:

# Format namenode (first time only)
hdfs namenode -format

# Start the cluster
start-all.sh

Starting and Stopping Hadoop Cluster

🎯 Quick Start (Recommended)

Use the convenient control script included in this repository:

# Start the cluster
./hadoop-control.sh start

# Stop the cluster  
./hadoop-control.sh stop

# Restart the cluster
./hadoop-control.sh restart

# Check cluster status
./hadoop-control.sh status

# Show help
./hadoop-control.sh help

This script provides:

  • Colored output for easy reading
  • Automatic error checking and validation
  • Smart service detection - knows what's running
  • Web interface links when cluster is healthy
  • Safe start/stop procedures with proper sequencing

🚀 Manual Service Management

Method 1: Start All Services at Once

# Start all Hadoop services
start-all.sh

# Check if all services are running
jps

Method 2: Start Services Individually

# 1. Start HDFS services (NameNode, DataNode, SecondaryNameNode)
start-dfs.sh

# 2. Start YARN services (ResourceManager, NodeManager)
start-yarn.sh

# 3. Start MapReduce Job History Server (optional)
mapred --daemon start historyserver

Method 3: Start Services One by One

# Start NameNode
hdfs --daemon start namenode

# Start DataNode
hdfs --daemon start datanode

# Start SecondaryNameNode
hdfs --daemon start secondarynamenode

# Start ResourceManager
yarn --daemon start resourcemanager

# Start NodeManager
yarn --daemon start nodemanager

# Start Job History Server
mapred --daemon start historyserver

⏹️ Stopping Hadoop Services

Stop All Services

# Stop all Hadoop services
stop-all.sh

Stop Services Individually

# Stop YARN services
stop-yarn.sh

# Stop HDFS services
stop-dfs.sh

# Stop Job History Server
mapred --daemon stop historyserver

Stop Services One by One

# Stop Job History Server
mapred --daemon stop historyserver

# Stop NodeManager
yarn --daemon stop nodemanager

# Stop ResourceManager
yarn --daemon stop resourcemanager

# Stop SecondaryNameNode
hdfs --daemon stop secondarynamenode

# Stop DataNode
hdfs --daemon stop datanode

# Stop NameNode
hdfs --daemon stop namenode

🔄 Restart Services

# Restart all services
stop-all.sh && start-all.sh

# Restart HDFS only
stop-dfs.sh && start-dfs.sh

# Restart YARN only
stop-yarn.sh && start-yarn.sh

✅ Verify Services are Running

Check Running Java Processes

# List all Hadoop-related Java processes
jps

# Expected output should include:
# 12345 NameNode
# 12346 DataNode
# 12347 SecondaryNameNode
# 12348 ResourceManager
# 12349 NodeManager
# 12350 JobHistoryServer (if started)

Check Specific Service Status

# Check if NameNode is running
hdfs dfsadmin -report

# Check if YARN is running
yarn node -list

# Check cluster health
hdfs dfsadmin -safemode get

🧪 Testing Hadoop Installation

Test 1: Basic HDFS Operations

# Create your user directory in HDFS
hdfs dfs -mkdir -p /user/$USER

# Create a test directory
hdfs dfs -mkdir /user/$USER/test

# List HDFS root directory
hdfs dfs -ls /

# List your user directory
hdfs dfs -ls /user/$USER

# Check HDFS health
hdfs fsck /

Test 2: File Upload and Download

# Create a test file locally
echo "Hello Hadoop World!" > test.txt
echo "This is a test file for Hadoop HDFS" >> test.txt

# Upload file to HDFS
hdfs dfs -put test.txt /user/$USER/

# List files in HDFS
hdfs dfs -ls /user/$USER/

# View file content in HDFS
hdfs dfs -cat /user/$USER/test.txt

# Download file from HDFS
hdfs dfs -get /user/$USER/test.txt downloaded_test.txt

# Verify the downloaded file
cat downloaded_test.txt

# Clean up test files
rm test.txt downloaded_test.txt
hdfs dfs -rm /user/$USER/test.txt

Test 3: Run Sample MapReduce Job

# Create input directory for MapReduce
hdfs dfs -mkdir /input

# Copy Hadoop configuration files as input
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /input/

# Run the word count example
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar wordcount /input /output

# Check the output
hdfs dfs -ls /output/
hdfs dfs -cat /output/part-r-00000 | head -20

# Clean up
hdfs dfs -rm -r /output
hdfs dfs -rm -r /input

Test 4: YARN Application Test

# Run a simple YARN application (Pi calculation)
yarn jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar pi 2 100

# Check application history
yarn application -list -appStates ALL

Test 5: Performance Benchmark Tests

# TestDFSIO Write Test
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -write -nrFiles 4 -fileSize 128MB

# TestDFSIO Read Test
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -read -nrFiles 4 -fileSize 128MB

# Clean up test data
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-*-tests.jar TestDFSIO -clean

📊 Monitoring and Health Checks

Web Interface Access

# Open web interfaces (run these commands to get URLs)
echo "NameNode Web UI: http://localhost:9870"
echo "ResourceManager Web UI: http://localhost:8088"
echo "Job History Server: http://localhost:19888"
echo "DataNode Web UI: http://localhost:9864"
echo "NodeManager Web UI: http://localhost:8042"

Command Line Monitoring

# Check cluster summary
hdfs dfsadmin -report

# Check filesystem
hdfs fsck /

# Monitor YARN applications
yarn top

# Check node status
yarn node -list -showDetails

# View cluster metrics
yarn cluster -lnl

🔧 Cluster Node Management

Node Management Script

I've created a comprehensive script to manage cluster nodes:

# Make the script executable (if not already)
chmod +x ./manage-cluster-nodes.sh

# Show all available commands
./manage-cluster-nodes.sh help

Key Features

📋 Node Information

# List all current nodes
./manage-cluster-nodes.sh list

# Show detailed cluster status
./manage-cluster-nodes.sh status

Adding Nodes

# Add a new worker node
./manage-cluster-nodes.sh add worker-node-1

# Add node by IP address
./manage-cluster-nodes.sh add 192.168.1.100

Removing Nodes (Safe Process)

# Step 1: Safely decommission the node
./manage-cluster-nodes.sh decommission worker-node-1

# Step 2: Remove from cluster (after decommissioning completes)
./manage-cluster-nodes.sh remove worker-node-1

🔄 Node Management

# Bring back a decommissioned node
./manage-cluster-nodes.sh recommission worker-node-1

🏗️ Cluster Conversion

# Convert single-node to multi-node setup
./manage-cluster-nodes.sh convert-multi

# Convert multi-node back to single-node
./manage-cluster-nodes.sh convert-single

💾 Configuration Backup & Restore

# Backup current configuration
./manage-cluster-nodes.sh backup

# Restore from backup
./manage-cluster-nodes.sh restore

Adding a New Worker Node - Complete Process

  1. Add node to cluster configuration:

    ./manage-cluster-nodes.sh add worker-node-2
  2. Set up SSH passwordless access:

    # Copy SSH key to new node
    ssh-copy-id user@worker-node-2
    
    # Test SSH access
    ssh worker-node-2
  3. Install Hadoop on the new node:

    # Copy Hadoop installation to new node
    scp -r $HADOOP_HOME user@worker-node-2:/opt/
  4. Copy configuration files:

    # Copy configuration to new node
    scp -r $HADOOP_HOME/etc/hadoop/* user@worker-node-2:$HADOOP_HOME/etc/hadoop/
  5. Refresh cluster nodes:

    # Refresh YARN nodes
    yarn rmadmin -refreshNodes
    
    # Refresh HDFS nodes
    hdfs dfsadmin -refreshNodes
  6. Start services on new node:

    # On the new worker node, start DataNode and NodeManager
    ssh worker-node-2 "$HADOOP_HOME/bin/hdfs --daemon start datanode"
    ssh worker-node-2 "$HADOOP_HOME/bin/yarn --daemon start nodemanager"

Safely Removing a Node - Complete Process

  1. Decommission the node:

    ./manage-cluster-nodes.sh decommission worker-node-2
  2. Monitor decommissioning progress:

    # Check HDFS decommissioning status
    hdfs dfsadmin -report
    
    # Check YARN node status
    yarn node -list -all
  3. Wait for decommissioning to complete (data blocks are moved to other nodes)

  4. Remove the node:

    ./manage-cluster-nodes.sh remove worker-node-2
  5. Stop services on the removed node:

    ssh worker-node-2 "$HADOOP_HOME/bin/yarn --daemon stop nodemanager"
    ssh worker-node-2 "$HADOOP_HOME/bin/hdfs --daemon stop datanode"

Configuration Files Modified

The script automatically manages these files:

  • $HADOOP_HOME/etc/hadoop/workers - List of worker nodes
  • $HADOOP_HOME/etc/hadoop/core-site.xml - Core configuration
  • $HADOOP_HOME/etc/hadoop/yarn-site.xml - YARN configuration
  • $HADOOP_HOME/etc/hadoop/dfs.exclude - HDFS decommission list
  • $HADOOP_HOME/etc/hadoop/yarn.exclude - YARN decommission list

Backup and Recovery

All configuration changes are automatically backed up to:

$HADOOP_HOME/backups/hadoop_config_YYYYMMDD_HHMMSS.tar.gz

You can restore any backup using:

./manage-cluster-nodes.sh restore

Quick Reference

For a complete quick reference guide, see: CLUSTER-MANAGEMENT.md


Verification and Testing

Check Cluster Status

# Check HDFS status
hdfs dfsadmin -report

# Check YARN nodes
yarn node -list

# Check running processes
jps

Web Interfaces

Basic HDFS Operations

# Create directories in HDFS
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/$USER

# List HDFS contents
hdfs dfs -ls /

# Copy file to HDFS
hdfs dfs -put /path/to/local/file /user/$USER/

# Copy file from HDFS
hdfs dfs -get /user/$USER/file /path/to/local/

Run Sample MapReduce Job

# Create input directory
hdfs dfs -mkdir /input

# Copy input files
hdfs dfs -put $HADOOP_HOME/etc/hadoop/*.xml /input

# Run word count example
hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.1.jar wordcount /input /output

# Check output
hdfs dfs -cat /output/part-r-00000

Troubleshooting

Common Issues and Solutions

1. Java-related Errors

# Verify Java installation
java -version
echo $JAVA_HOME

# Ensure JAVA_HOME is set in hadoop-env.sh

2. SSH Connection Issues

# Test SSH connectivity
ssh localhost
ssh worker-node-1

# Check SSH key setup
ls -la ~/.ssh/

3. Permission Denied Errors

# Fix directory permissions
sudo chown -R $USER:$USER $HADOOP_HOME
chmod 755 $HADOOP_HOME/data/*

4. Port Already in Use

# Check what's using the port
netstat -tulpn | grep :9000
lsof -i :9000

# Kill the process if necessary
kill -9 <PID>

5. DataNode Not Starting

# Check logs
tail -f $HADOOP_HOME/logs/hadoop-*-datanode-*.log

# Common solution: Remove and reformat
stop-all.sh
rm -rf $HADOOP_HOME/data/datanode/*
hdfs namenode -format -force
start-all.sh

Log Files Location

# Hadoop logs directory
$HADOOP_HOME/logs/

# Important log files
hadoop-*-namenode-*.log
hadoop-*-datanode-*.log
yarn-*-resourcemanager-*.log
yarn-*-nodemanager-*.log

Useful Commands

HDFS Commands

# File system check
hdfs fsck /

# Safe mode operations
hdfs dfsadmin -safemode leave
hdfs dfsadmin -safemode enter

# Balance cluster
hdfs balancer

# Decommission nodes
hdfs dfsadmin -refreshNodes

YARN Commands

# List applications
yarn application -list

# Kill application
yarn application -kill <application_id>

# Node management
yarn node -list -all
yarn rmadmin -refreshNodes

Cluster Administration

# Check cluster health
hdfs dfsadmin -report
yarn node -list -showDetails

# Monitor cluster
hadoop dfsadmin -printTopology
yarn top

Performance Monitoring

# Check disk usage
hdfs dfs -du -h /

# Monitor system resources
top
htop
iostat -x 1

References

Support

For issues and questions:


Note: This guide is based on Hadoop 3.4.1. Configuration may vary slightly for different versions.

About

Complete Apache Hadoop 3.4.1 cluster installation and management toolkit with automated scripts, comprehensive documentation, and production-ready configuration templates for single-node and multi-node deployments.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages