Distributed Sorting Algorithm – Sort Large Datasets Across Threads or Nodes

Overview

This project implements a Distributed Sorting Algorithm using Apache Hadoop’s MapReduce framework.
It is designed to efficiently sort large datasets by distributing the sorting process across multiple threads or nodes, making use of Hadoop’s parallel data processing capabilities.

The project was developed as part of the Distributed Systems course to demonstrate how distributed computing principles can handle data-intensive operations such as large-scale sorting.

Project Objectives

Implement a scalable sorting algorithm using Hadoop.
Utilize the MapReduce paradigm to distribute the sorting workload across nodes.
Demonstrate parallelism, fault tolerance, and data distribution in a real-world scenario.
Understand Hadoop’s Mapper, Reducer, and Driver components and their coordination.

System Requirements

To run this project locally or on a cluster, you need:

Java JDK 8 or higher
Hadoop 3.x (set up in pseudo-distributed or fully distributed mode)
Linux / macOS / Windows Subsystem for Linux (WSL) environment
At least 8 GB of RAM and 10 GB of free disk space
SSH configured for Hadoop

Project Structure


Distributed-Sorting-Algorithm/
│
├── src/
│   ├── SortMapper.java
│   ├── SortReducer.java
│   └── SortDriver.java
│
├── input/
│   └── numbers.txt            # Input file containing unsorted integers
│
├── output/                    # Output folder generated by Hadoop after sorting
│
└── README.md

Implementation Details

1. Mapper

public class SortMapper extends Mapper<LongWritable, Text, IntWritable, NullWritable> {
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        int num = Integer.parseInt(value.toString());
        context.write(new IntWritable(num), NullWritable.get());
    }
}

Reads each line (integer) from the input file.
Converts it into an integer key.
Emits the number as the key and NullWritable as the placeholder value.
Hadoop automatically sorts keys during the shuffle phase.

2. Reducer

    public class SortReducer extends Reducer<IntWritable, NullWritable, IntWritable, NullWritable> {
    @Override
    protected void reduce(IntWritable key, Iterable<NullWritable> values, Context context)
    throws IOException, InterruptedException {
        context.write(key, NullWritable.get());
    }
}

Receives the sorted keys (numbers) from the mapper output.
Simply writes them to the output since Hadoop already sorts the keys during the shuffle/sort phase.

3. Driver

public class SortDriver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Distributed Sort");
        job.setJarByClass(SortDriver.class);
        job.setMapperClass(SortMapper.class);
        job.setReducerClass(SortReducer.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(NullWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.addOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

Sets up the Hadoop job configuration.
Specifies the Mapper, Reducer, and output types.
Defines input and output paths.
Executes the MapReduce job.

How It Works

Input Stage
- The dataset (e.g., numbers.txt) is stored in HDFS.
- Each line represents an integer to be sorted.
Map Stage
- Each node reads part of the input and emits (number, NullWritable) pairs.
Shuffle and Sort Stage
- Hadoop automatically sorts keys (numbers) before passing them to the Reducer.
Reduce Stage
- Reducers write out the sorted keys as final output.
Output Stage
- The sorted data is stored in HDFS in the output directory you specify.

Example

Input (numbers.txt):

Command to Run:

hadoop jar DistributedSort.jar SortDriver /user/input/numbers.txt /user/output

Output (in HDFS /user/output/part-r-00000):

How to Run Locally (Pseudo-Distributed Mode)

Start Hadoop

start-dfs.sh
start-yarn.sh

Create Directories in HDFS

hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/input

Upload Input File

hdfs dfs -put input/numbers.txt /user/input

Run the Job

hadoop jar DistributedSort.jar SortDriver /user/input /user/output

View Results

hdfs dfs -cat /user/output/part-r-00000

Expected Results

The output file will contain sorted integers in ascending order.
Sorting is achieved through Hadoop’s shuffle and sort process across multiple nodes or threads.
The project demonstrates scalability, parallel processing, and fault tolerance.

Contributors

License

This project is developed for educational purposes as part of the Distributed Systems course at Misr International University. You may reuse and modify it for learning and research purposes.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Updated		Updated
Blue Gradient Modern Illustration Computer Presentation.pptx		Blue Gradient Modern Illustration Computer Presentation.pptx
LICENSE		LICENSE
README.md		README.md
SortDriver.java		SortDriver.java
SortMapper.txt		SortMapper.txt
SortReducer.java		SortReducer.java
alphanumeric.txt		alphanumeric.txt
numbers.txt		numbers.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Distributed Sorting Algorithm – Sort Large Datasets Across Threads or Nodes

Overview

Project Objectives

System Requirements

Project Structure

Implementation Details

1. Mapper

2. Reducer

3. Driver

How It Works

Example

How to Run Locally (Pseudo-Distributed Mode)

Expected Results

Contributors

License

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

License

Nouran246/Distributed-Sorting-Algorithm-Sort-Large-Datasets-Across-Threads-or-Nodes

Folders and files

Latest commit

History

Repository files navigation

Distributed Sorting Algorithm – Sort Large Datasets Across Threads or Nodes

Overview

Project Objectives

System Requirements

Project Structure

Implementation Details

1. Mapper

2. Reducer

3. Driver

How It Works

Example

How to Run Locally (Pseudo-Distributed Mode)

Expected Results

Contributors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages