Big Data analysis and distributed computing using Hadoop MapReduce and Apache Spark
π Documentation β’ π Features β’ βοΈ Installation β’ π Performance Analysis
This project demonstrates advanced Big Data processing techniques using two industry-standard frameworks: Hadoop MapReduce and Apache Spark. The work focuses on solving real-world retail data problems by leveraging the specific strengths of each distributed computing paradigm.
The repository showcases:
- π Hadoop MapReduce for fault-tolerant, disk-based batch processing
- π₯ Apache Spark for fast, in-memory iterative computations
- βοΈ Performance comparison between single-node and cluster deployments
| Technology | Purpose | Version |
|---|---|---|
| Hadoop MapReduce | Distributed batch processing | 3.3.6 |
| Apache Spark | In-memory data processing | Latest |
| Java | Implementation language | JDK 8+ |
| Docker | Cluster containerization | Latest |
| HDFS | Distributed file system | 3.3.6 |
| YARN | Resource management | 3.3.6 |
Online retail stores need to identify and reward their most loyal customers. This exercise implements a customer loyalty ranking system that analyzes transaction data to classify customers based on:
- Total sales volume
- Number of purchases in the current year
- Customer segment classification
| Component | Class | Responsibility |
|---|---|---|
| Driver | DriverCustomerLoyalty.java |
Job configuration, input/output management, cluster execution |
| Mapper | MapperCustomerLoyalty.java |
Data extraction, key-value pair emission (CustomerID β Transaction Data) |
| Reducer | ReducerCustomerLoyalty.java |
Data aggregation, loyalty metrics calculation |
| Custom Writable | CustomerDataWritable.java |
Efficient serialization of customer transaction details |
Numerical Summarization Pattern:
- Groups records by
CustomerID(key field) - Calculates numerical aggregates (total sales, purchase count)
- Produces consolidated customer loyalty metrics
- β Single Reducer Configuration: Dataset size manageable by single node, avoiding distributed overhead
- β No Combiners Used: Straightforward aggregations don't benefit from intermediate combining
- π Efficient Data Serialization: Custom Writable class minimizes network overhead
CustomerID,CustomerSegment,OrderCount,TotalSales,NumberOfPurchasesCurrentYear
11,Home Office,1,211.15,1
14,Small Business,1,1214.93,4
21,Small Business,1,3084.04,3
Identify regions with the highest and lowest sales figures to inform strategic business decisions, inventory management, and regional marketing strategies.
| Component | Responsibility |
|---|---|
| RegionSaleAnalysisDriver | Main Spark application coordinating the entire data flow |
| RDD Transformations | textFile β filter β mapToPair β reduceByKey |
| RDD Actions | max and min with custom serializable comparators |
- Data Loading: Read CSV into RDD of strings
- Header Filtering: Remove header row for accurate processing
- Key-Value Mapping: Transform to
(Region, Sales)pairs - Aggregation:
reduceByKeyto sum sales per region - Extremes Detection: Parallel
maxandminoperations - Result Persistence: Save results to distributed filesystem
- In-Memory Processing: Intermediate results cached in RAM (RDDs)
- Lazy Evaluation: Optimized execution plans
- Parallel Computation: Distributed max/min operations
- Fault Tolerance: RDD lineage for automatic recovery
max.txt:
(West,2347853.45)
min.txt:
(South,562341.78)
- Docker and Docker Compose installed
- Java Development Kit (JDK) 8 or higher
- Git for cloning the repository
git clone https://github.com/Crostino14/Big-Data-Project.git
cd Big-Data-Project# Navigate to Hadoop cluster directory
cd hadoop-cluster-3.3.6-amd64
# Build Docker image
docker build -t hadoop-new .
# Create Docker network
docker network create --driver bridge hadoop_network
# Start cluster
docker compose up -d# Navigate to Spark cluster directory
cd spark-cluster
# Start Spark cluster
docker compose up -d# Access master container
docker container exec -ti master bash
# Format HDFS namenode (first time only)
hdfs namenode -format
# Start HDFS and YARN
$HADOOP_HOME/sbin/start-dfs.sh
$HADOOP_HOME/sbin/start-yarn.sh
# Upload dataset to HDFS
cd data/
hdfs dfs -put store3.csv hdfs:///input
# Run MapReduce job
hadoop jar Exercise1_BIGDATA.jar /input /output
# Retrieve results
hdfs dfs -cat /output/part-r-00000 > output_file.csv# Remove HDFS directories
hdfs dfs -rm -r hdfs:///input
hdfs dfs -rm -r hdfs:///output
# Stop services
$HADOOP_HOME/sbin/stop-dfs.sh
$HADOOP_HOME/sbin/stop-yarn.sh
# Exit container
exit
# Stop Docker container
docker stop hadoop-new# Access Spark master container
docker container exec -ti sp_master bash
# Navigate to execution directory
cd bin/testfiles/
# Run in SINGLE-NODE mode (for development/testing)
./launch_single.sh
# Run in CLUSTER mode (for production/benchmarking)
./launch_cluster.sh
# View results
cd output1/max_sales/
cat part-00000 > max.txt
cd ../min_sales/
cat part-00000 > min.txtlaunch_single.sh (Local Mode):
/opt/bitnami/spark/bin/spark-submit \
--class it.unisa.diem.hpc.spark.exercise2.RegionSaleAnalysisDriver \
--master local \
./Exercise2_BIGDATA.jar \
./input ./output1launch_cluster.sh (Cluster Mode):
/opt/bitnami/spark/bin/spark-submit \
--class it.unisa.diem.hpc.spark.exercise2.RegionSaleAnalysisDriver \
--master spark://spark-master:7077 \
--deploy-mode client \
--supervise \
--executor-memory 1G \
./Exercise2_BIGDATA.jar \
./input ./output2| Aspect | Single-Node π» | Cluster π₯οΈπ₯οΈπ₯οΈ |
|---|---|---|
| Execution Time | 0.2 - 0.3s per stage | 0.1 - 0.6s per stage |
| Network Overhead | None | Data shuffling between executors |
| Resource Utilization | Single executor, limited resources | Distributed across multiple executors |
| Scalability | Not scalable | Highly scalable |
| Fault Tolerance | Limited | Automatic recovery via RDD lineage |
| Cost Efficiency | More cost-effective for small datasets | Justified for large-scale processing |
| Debugging | Simplified debugging process | More complex distributed debugging |
- β For small datasets (~MB range), single-node execution is comparable in performance
- β Cluster benefits become significant with larger datasets (GB-TB range)
- β Network overhead in cluster mode can offset benefits for trivial workloads
- β Cluster mode provides scalability and fault tolerance critical for production
| Scenario | Recommended Mode |
|---|---|
| Development & Testing | Single-Node |
| Small datasets (< 100 MB) | Single-Node |
| Large datasets (> 1 GB) | Cluster |
| Production deployments | Cluster |
| Iterative ML algorithms | Cluster |
| Real-time processing | Cluster |
For comprehensive implementation details, code snippets, theoretical background, and in-depth performance analysis, please refer to:
The report includes:
- Detailed problem statements and motivation
- Complete source code with explanations
- MapReduce and Spark design patterns
- Execution results and output analysis
- Performance benchmarking methodology
- Docker configuration details
- Complete command reference
- Apache Hadoop Documentation
- Apache Spark Documentation
- MapReduce Design Patterns
- Docker for Big Data
- Agostino Cardamone π (Student & Creator)
- Lecturer: Giuseppe D'Aniello π«