Input: file@linenum\tline
(where line may include tabs or spaces, and likely has many words)
Desired output: Each word, stripped of punctuation, paired with a comma-delimited list of file@linenum locations for quick lookup.
inverted-index.rb
#!/usr/bin/ruby
require "rubydoop"
HADOOP_HOME = "/usr/local/hadoop"
map do |location, line|
line.split(/\s+/).each do |word|
next unless word.strip.length > 0
emit word.strip.downcase.gsub(/^\(|[^a-zA-Z]$/, ''), location
end
end
reduce do |key, values|
emit key, values.join(",")
end
./inverted-index.rb startAssuming you have your hadoop environment all set up, this will fire up a task with the appropriate map and reduce functions.
./inverted-index.rb simulate test-file.txtWhich executes a poor-man’s local MR:
cat test-file.txt | ./inverted-index.rb map | sort | ./inverted-index.rb reduce