Enhanced MapReduce Implementation

This project is a custom implementation of Google's MapReduce architecture, combining Python for orchestration with C++ for performance-critical components.

Overview

MapReduce is a programming model designed for processing and generating large datasets on distributed systems. This implementation features:

Python core for workflow orchestration, task distribution and fault tolerance
C++ performance modules for data parsing, key-value pair generation and sorting
Multiprocessing for parallel task execution and efficient resource utilization
Fault-tolerance mechanism to handle process failures gracefully

Architecture

The system consists of three main components:

Master: Orchestrates the entire MapReduce job, manages workers and handles fault tolerance
Mappers: Process input data and emit intermediate key-value pairs
Reducers: Process intermediate data and generate final output

Performance-Critical C++ Components

Performance-sensitive operations are implemented in C++ and integrated with Python via pybind11:

Data parsing: Fast parsing of input data in mappers
Key-value pair generation: Efficient emission and collection of intermediate data
Sorting: High-performance sorting of keys before reduction
JSON processing: Optimized reading and writing of intermediate and output data

Features

Fault tolerance: Automatically detects and restarts failed mapper processes
Dynamic scaling: Configurable number of mappers and reducers
Hybrid implementation: Combines Python's ease of use with C++'s performance
Graceful fallback: Uses Python implementation if C++ modules are unavailable

Requirements

Python 3.6+
C++ compiler with C++17 support
pybind11
nlohmann-json C++ library

Installation

# Install Python dependencies
pip install -r requirements.txt

# Build C++ components
python setup.py build_ext --inplace

# Alternative: Use CMake
mkdir build && cd build
cmake ..
make

Usage

Create a configuration file (JSON format):

{
  "input_file": "path/to/input.txt",
  "number_of_mapper": 4,
  "number_of_reducer": 2,
  "use_cpp": true
}

Define your map and reduce functions:

def my_map_function(key, value, emit):
    # Process input and emit intermediate key-value pairs
    for word in value.split():
        emit(word, 1)

def my_reduce_function(key, values, emit):
    # Process intermediate data and emit final output
    emit(key, sum(int(v) for v in values))

Run the MapReduce job:

from main import initialize_master

initialize_master(
    num_mappers=4,
    num_reducers=2,
    input_file="input.txt",
    user_defined_map=my_map_function,
    user_defined_reduce=my_reduce_function,
    kill_idx=-1  # Set to mapper index to test fault tolerance
)

Project Structure

mapreduce/
│
├── src/
│   ├── python/
│   │   ├── __init__.py
│   │   ├── main.py
│   │   ├── map.py
│   │   └── reduce.py
│   │
│   └── cpp/
│       ├── mapper.cpp
│       ├── reducer.cpp
│       └── json_utils.hpp
|
├── setup.py
├── CMakeLists.txt
├── requirements.txt
└── README.md

Performance Improvements

The C++ components provide significant performance improvements over the pure Python implementation:

Faster data parsing and processing
More efficient memory usage
Better sorting performance for large datasets
Improved handling of large files

Example Application: Word Count

A classic MapReduce example is counting word occurrences in a text corpus:

def map_word_count(_, line, emit):
    for word in line.split():
        emit(word.lower(), 1)

def reduce_word_count(word, counts, emit):
    emit(word, str(sum(int(count) for count in counts)))

Technical Highlights

Object-oriented design with clear separation of concerns
Cross-language integration using pybind11
Parallel processing with Python's multiprocessing
Performance optimization through C++ implementations
Fault tolerance with process monitoring and recovery

Future Enhancements

Distributed execution across multiple machines
Improved shuffling mechanism for better load balancing
Additional built-in map and reduce functions
Support for more complex data types and operations

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Enhanced MapReduce Implementation

Overview

Architecture

Performance-Critical C++ Components

Features

Requirements

Installation

Usage

Project Structure

Performance Improvements

Example Application: Word Count

Technical Highlights

Future Enhancements

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.vscode		.vscode
src		src
CMakeLists.txt		CMakeLists.txt
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

san-deep-reddy/map-reduce

Folders and files

Latest commit

History

Repository files navigation

Enhanced MapReduce Implementation

Overview

Architecture

Performance-Critical C++ Components

Features

Requirements

Installation

Usage

Project Structure

Performance Improvements

Example Application: Word Count

Technical Highlights

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages