Read a file of text, determine the n most frequently used words, and print out a sorted list of those words along with their frequencies.
I found this problem statement on this blog
PDF of original magazine exchange here
I thought I'd do it in Go to see how hard it is.
I found a guy who uses this problem as a job interview question. He offers quite a bit of what I call "interview analysis". He's got a Go version to compare this code to. Later, he wrote a performance analysis of it.
- McIlroy's shell script version
- My Go version, 68 lines, some blank, some boilerplate
Test inputs:
- Small count word frequency
- Upper/lower case word frequency
Even a few simple test cases revealed bugs in my program, so it's probably not as good as Knuth's literate, almost guaranteed-correct version.
The Go program imports 7 packages from the standard library,
and defines 2 new types.
It makes heavy use of standard library for line-oriented I/O,
and to break a line into words.
One of the new types arises because of the way the "top N" requirement
interacts with the sort package's use of Go interfaces.
The actual unique-word counting gets done using a Go map type.
It took far less than 3 hours to do this task. The git logs show amost exactly 3 hours between first and last check in of wc.go. I did eat a meal in there, and have a longish discussion with a family member.
This compares very favorably with the 9.5 hours spent implementing a C language version published contemporaneously with Knuth's solution.
The real problem is deciding what's the correct output
for something like this very README.md file.
1499 % ./mcilroy 5 < README.md
13 the
10 a
9 and
9 go
9 of
1500 % ./wc 5 < README.md
./wc 5 < README.md
13 the
10 a
9 of
8 and
6 goMcIlroy's shell script turns all punctuation marks into newlines, exposing "-and-", which appears in a URL, as the word "and". My program does not truck with punctuation, so it misses that "and". Should the string "revisiting-knuth-and-mcilroys-word-count-programs" count as 7 "words", or just one?
This small difference illustrates a major problem in software engineering. What's obvious to the requirements writer may not be obvious to the programmer. Programming entails attention to "minor" details like that, and "correct" vs "incorrect" programs hinge on those minor details.
I'm going to both agree and disagree with the blogger, Franklin Chen. I'm skeptical of literate programming for the same reasons he is. But I don't think that "general purpose" programming languages are a universal replacement for shell scripts. It's far harder to get a C, Go or Haskell program correct for some weird variant on word frequency, than it is to get a shell script correct. Knowing how to do the kind of text processing that McIlroy's script illustrates is a valuable skill when dealing with more than a page or two of information. Modern sensibilities are distorted by Microsoft's vision of what constitutes a good amount of information to process. Word and Excel have lowered this "good amount" to about a page. Simple, one-off text processing shell scripts have a lot of value.