This project is a custom, disk-based implementation of the MapReduce programming model, built from the ground up in Python. It processes a large collection of text documents to build a foundational data structure of search engines: an inverted index.
The system is designed as a functional, standalone framework that showcases the core mechanics of fault-tolerant, parallel data processing. The workflow is orchestrated by a master script (main.py) that manages the entire data pipeline, demonstrating the core principles of large-scale data processing on a single machine.
The framework faithfully implements the key phases of a classic MapReduce job:
- 
Master Controller (
main.py): Acts as the job coordinator, initiating and monitoring all worker processes and managing the data pipeline from start to finish. - 
Input Splitting: The input data is pre-split into individual
.txtfiles within thedata/directory, with each file representing a data chunk for a Mapper task. - 
Map Phase (Parallel & Decentralized):
main.pyleverages thesubprocessmodule to launchmapper.pytasks in parallel, one for each input file.- Decentralized Partitioning: Each 
mapper.pyworker contains its own hash-based partitioner. It writes its key-value output directly to multiple temporary files on disk, one for each conceptual Reducer. This accurately mirrors how a distributed framework like Hadoop handles intermediate data, eliminating central bottlenecks. 
 - 
Shuffle & Sort Phase (The "Distributed Dance"):
- Shuffle: The 
main.pyorchestrator performs the Shuffle phase. For each Reducer, it groups all corresponding temporary partition files from the Mappers (*_part_0.out,*_part_1.out, etc.). - Sort: For each group of files, 
main.pyfirst concatenates them and then performs an in-memory sort. This prepares a perfectly ordered input stream for each Reducer, mirroring the outcome of a distributed merge-sort. 
 - Shuffle: The 
 - 
Reduce Phase (Parallel Aggregation):
main.pylaunches a separatereducer.pyprocess for each partition, feeding it the corresponding sorted data viastdin.- Each Reducer aggregates the data for its assigned keys and writes its final output to a unique part-file (e.g., 
part-r-00000) in theoutput/directory. 
 - 
Final Aggregation: The master script concatenates all Reducer part-files into a single, comprehensive
inverted_index.txtfile. 
- Core Language: Python 3
 - Core Concepts: MapReduce, Inverted Index, Distributed Systems Architecture, Data Pipelines, NLP.
 - Process Management: 
subprocessmodule to orchestrate parallel worker processes. - File Management: 
os,glob,shutilmodules for managing the disk-based intermediate data flow. - Text Processing: A robust text normalization pipeline using NLTK, which includes:
- Lowercasing
 - Tokenization and Punctuation Removal
 - Stop Word Filtering
 - Lemmatization to reduce words to their root form for higher-quality indexing.
 
 
- Python 3.8+
 - The NLTK library: 
pip install nltk - NLTK data modules. Run this in a Python shell to download them:
import nltk nltk.download('punkt') # For tokenization nltk.download('stopwords') # For stop words nltk.download('wordnet') # For lemmatization
 
Make sure your project is organized as follows for the imports to work correctly:
├── data/
│   ├── doc1.txt
│   └── doc2.txt
├── main.py
├── minisearch/
│   ├── __init__.py
│   └── mapreduce/
│       ├── __init__.py
│       ├── mapper.py
│       ├── reducer.py
│       └── process_text.py
└── README.md
Clone the Repository:
git clone https://github.com/foo290/minisearch.git
cd minisearch
Add Data: Place your .txt document files inside the data/ directory. Run the Framework: Execute the main orchestrator script from the root directory.
python main.py
The script will provide real-time feedback on each phase of the MapReduce job and clean up temporary directories. The final output will be located at output/inverted_index.txt.
The output file output/inverted_index.txt will contain the final inverted index. Each line consists of a word, a tab character, and a Python list representation of the unique document IDs where that word was found.