This project implements a Distributed Sorting Algorithm using Apache Hadoop’s MapReduce framework.
It is designed to efficiently sort large datasets by distributing the sorting process across multiple threads or nodes, making use of Hadoop’s parallel data processing capabilities.
The project was developed as part of the Distributed Systems course to demonstrate how distributed computing principles can handle data-intensive operations such as large-scale sorting.
- Implement a scalable sorting algorithm using Hadoop.
- Utilize the MapReduce paradigm to distribute the sorting workload across nodes.
- Demonstrate parallelism, fault tolerance, and data distribution in a real-world scenario.
- Understand Hadoop’s Mapper, Reducer, and Driver components and their coordination.
To run this project locally or on a cluster, you need:
- Java JDK 8 or higher
- Hadoop 3.x (set up in pseudo-distributed or fully distributed mode)
- Linux / macOS / Windows Subsystem for Linux (WSL) environment
- At least 8 GB of RAM and 10 GB of free disk space
- SSH configured for Hadoop
Distributed-Sorting-Algorithm/
│
├── src/
│ ├── SortMapper.java
│ ├── SortReducer.java
│ └── SortDriver.java
│
├── input/
│ └── numbers.txt # Input file containing unsorted integers
│
├── output/ # Output folder generated by Hadoop after sorting
│
└── README.md
public class SortMapper extends Mapper<LongWritable, Text, IntWritable, NullWritable> {
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
int num = Integer.parseInt(value.toString());
context.write(new IntWritable(num), NullWritable.get());
}
}- Reads each line (integer) from the input file.
- Converts it into an integer key.
- Emits the number as the key and
NullWritableas the placeholder value. - Hadoop automatically sorts keys during the shuffle phase.
public class SortReducer extends Reducer<IntWritable, NullWritable, IntWritable, NullWritable> {
@Override
protected void reduce(IntWritable key, Iterable<NullWritable> values, Context context)
throws IOException, InterruptedException {
context.write(key, NullWritable.get());
}
}- Receives the sorted keys (numbers) from the mapper output.
- Simply writes them to the output since Hadoop already sorts the keys during the shuffle/sort phase.
public class SortDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Distributed Sort");
job.setJarByClass(SortDriver.class);
job.setMapperClass(SortMapper.class);
job.setReducerClass(SortReducer.class);
job.setMapOutputKeyClass(IntWritable.class);
job.setMapOutputValueClass(NullWritable.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(NullWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.addOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}- Sets up the Hadoop job configuration.
- Specifies the Mapper, Reducer, and output types.
- Defines input and output paths.
- Executes the MapReduce job.
-
Input Stage
- The dataset (e.g.,
numbers.txt) is stored in HDFS. - Each line represents an integer to be sorted.
- The dataset (e.g.,
-
Map Stage
- Each node reads part of the input and emits
(number, NullWritable)pairs.
- Each node reads part of the input and emits
-
Shuffle and Sort Stage
- Hadoop automatically sorts keys (numbers) before passing them to the Reducer.
-
Reduce Stage
- Reducers write out the sorted keys as final output.
-
Output Stage
- The sorted data is stored in HDFS in the output directory you specify.
Input (numbers.txt):
5
1
9
3
7
2
Command to Run:
hadoop jar DistributedSort.jar SortDriver /user/input/numbers.txt /user/outputOutput (in HDFS /user/output/part-r-00000):
1
2
3
5
7
9
- Start Hadoop
start-dfs.sh
start-yarn.sh- Create Directories in HDFS
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/input- Upload Input File
hdfs dfs -put input/numbers.txt /user/input- Run the Job
hadoop jar DistributedSort.jar SortDriver /user/input /user/output- View Results
hdfs dfs -cat /user/output/part-r-00000- The output file will contain sorted integers in ascending order.
- Sorting is achieved through Hadoop’s shuffle and sort process across multiple nodes or threads.
- The project demonstrates scalability, parallel processing, and fault tolerance.
This project is developed for educational purposes as part of the Distributed Systems course at Misr International University. You may reuse and modify it for learning and research purposes.