This project implements a simplified MapReduce framework using pure Bash scripting to count the total number of words in a large file. It demonstrates the fundamental concepts of distributed computing: splitting data, mapping, and reducing all within the shell environment.
├── map.sh Map script: counts words in a chunk
├── reduce.sh Reduce script: sums word counts
├── run.sh Orchestrates the whole MapReduce flow
├── input/ Contains the original input file
├── chunks/ Contains split chunks of the input
├── maps/ Contains outputs from the map step
└── output/ Final word count result
This project uses a built-in dictionary file as the input source.
- It's a system dictionary file commonly found on Unix-like systems (Linux, macOS).
- Contains a list of English words, typically one word per line.
- Used by programs like spell checkers, word games, or autocomplete tools.
- It’s a large, clean, and consistent text file perfect for testing.
- Easy to access without needing to download anything.
- Great for benchmarking word count operations in this MapReduce simulation.
- Ensures the script exits on errors with
set -e
. - Creates required directories:
input
,chunks
,maps
,output
. - Copies the input text file (
/usr/share/dict/words
) toinput/
. - Splits the input file into 4 equal-sized chunks.
- Runs
map.sh
in parallel on each chunk and stores results inmaps/
. - Waits for all mapping processes to finish.
- Runs
reduce.sh
to sum all word counts from the map outputs. - Prints the final word count to the terminal.
- Accepts a file name as input (from
run.sh
). - Uses
wc -w
to count the number of words in that file. - Outputs the word count (just a number).
- Represents a single "Map" task can be run independently on any data chunk.
- Accepts multiple
.out
files (outputs frommap.sh
). - Reads the number from each file (each is a word count).
- Adds all the numbers to get the total word count.
- Outputs the total to stdout (captured by
run.sh
intooutput/total.txt
). - Acts as a "Reducer" aggregates results from all mappers.
Follow these steps to run the Scripts:
chmod +x run.sh map.sh reduce.sh
./run.sh
This will print the total word count to the terminal and save it in output/total.txt