Skip to content

This project implements a Distributed Sorting Algorithm using Apache Hadoop’s MapReduce framework. It is designed to efficiently sort large datasets by distributing the sorting process across multiple threads or nodes, making use of Hadoop’s parallel data processing capabilities.

License

Notifications You must be signed in to change notification settings

Nouran246/Distributed-Sorting-Algorithm-Sort-Large-Datasets-Across-Threads-or-Nodes

Repository files navigation

Distributed Sorting Algorithm – Sort Large Datasets Across Threads or Nodes

Overview

This project implements a Distributed Sorting Algorithm using Apache Hadoop’s MapReduce framework.
It is designed to efficiently sort large datasets by distributing the sorting process across multiple threads or nodes, making use of Hadoop’s parallel data processing capabilities.

The project was developed as part of the Distributed Systems course to demonstrate how distributed computing principles can handle data-intensive operations such as large-scale sorting.


Project Objectives

  • Implement a scalable sorting algorithm using Hadoop.
  • Utilize the MapReduce paradigm to distribute the sorting workload across nodes.
  • Demonstrate parallelism, fault tolerance, and data distribution in a real-world scenario.
  • Understand Hadoop’s Mapper, Reducer, and Driver components and their coordination.

System Requirements

To run this project locally or on a cluster, you need:

  • Java JDK 8 or higher
  • Hadoop 3.x (set up in pseudo-distributed or fully distributed mode)
  • Linux / macOS / Windows Subsystem for Linux (WSL) environment
  • At least 8 GB of RAM and 10 GB of free disk space
  • SSH configured for Hadoop

Project Structure


Distributed-Sorting-Algorithm/
│
├── src/
│   ├── SortMapper.java
│   ├── SortReducer.java
│   └── SortDriver.java
│
├── input/
│   └── numbers.txt            # Input file containing unsorted integers
│
├── output/                    # Output folder generated by Hadoop after sorting
│
└── README.md


Implementation Details

1. Mapper

public class SortMapper extends Mapper<LongWritable, Text, IntWritable, NullWritable> {
    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        int num = Integer.parseInt(value.toString());
        context.write(new IntWritable(num), NullWritable.get());
    }
}
  • Reads each line (integer) from the input file.
  • Converts it into an integer key.
  • Emits the number as the key and NullWritable as the placeholder value.
  • Hadoop automatically sorts keys during the shuffle phase.

2. Reducer

    public class SortReducer extends Reducer<IntWritable, NullWritable, IntWritable, NullWritable> {
    @Override
    protected void reduce(IntWritable key, Iterable<NullWritable> values, Context context)
    throws IOException, InterruptedException {
        context.write(key, NullWritable.get());
    }
}
  • Receives the sorted keys (numbers) from the mapper output.
  • Simply writes them to the output since Hadoop already sorts the keys during the shuffle/sort phase.

3. Driver

public class SortDriver {
    public static void main(String[] args) throws Exception {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf, "Distributed Sort");
        job.setJarByClass(SortDriver.class);
        job.setMapperClass(SortMapper.class);
        job.setReducerClass(SortReducer.class);
        job.setMapOutputKeyClass(IntWritable.class);
        job.setMapOutputValueClass(NullWritable.class);
        job.setOutputKeyClass(IntWritable.class);
        job.setOutputValueClass(NullWritable.class);
        FileInputFormat.addInputPath(job, new Path(args[0]));
        FileOutputFormat.addOutputPath(job, new Path(args[1]));
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}
  • Sets up the Hadoop job configuration.
  • Specifies the Mapper, Reducer, and output types.
  • Defines input and output paths.
  • Executes the MapReduce job.

How It Works

  1. Input Stage

    • The dataset (e.g., numbers.txt) is stored in HDFS.
    • Each line represents an integer to be sorted.
  2. Map Stage

    • Each node reads part of the input and emits (number, NullWritable) pairs.
  3. Shuffle and Sort Stage

    • Hadoop automatically sorts keys (numbers) before passing them to the Reducer.
  4. Reduce Stage

    • Reducers write out the sorted keys as final output.
  5. Output Stage

    • The sorted data is stored in HDFS in the output directory you specify.

Example

Input (numbers.txt):

5
1
9
3
7
2

Command to Run:

hadoop jar DistributedSort.jar SortDriver /user/input/numbers.txt /user/output

Output (in HDFS /user/output/part-r-00000):

1
2
3
5
7
9

How to Run Locally (Pseudo-Distributed Mode)

  1. Start Hadoop
start-dfs.sh
start-yarn.sh
  1. Create Directories in HDFS
hdfs dfs -mkdir /user
hdfs dfs -mkdir /user/input
  1. Upload Input File
hdfs dfs -put input/numbers.txt /user/input
  1. Run the Job
hadoop jar DistributedSort.jar SortDriver /user/input /user/output
  1. View Results
hdfs dfs -cat /user/output/part-r-00000

Expected Results

  • The output file will contain sorted integers in ascending order.
  • Sorting is achieved through Hadoop’s shuffle and sort process across multiple nodes or threads.
  • The project demonstrates scalability, parallel processing, and fault tolerance.

Contributors


License

This project is developed for educational purposes as part of the Distributed Systems course at Misr International University. You may reuse and modify it for learning and research purposes.

About

This project implements a Distributed Sorting Algorithm using Apache Hadoop’s MapReduce framework. It is designed to efficiently sort large datasets by distributing the sorting process across multiple threads or nodes, making use of Hadoop’s parallel data processing capabilities.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages