Skip to content

Calculate word frequency

Philipp Zumstein edited this page Feb 24, 2017 · 2 revisions


The easiest is now to just use


see hocr-wordfreq.

alternative way (may still work)

It is possible to calculate the word frequencies of a hocr-file with just some standard command line programs.

1. Text extraction

To extract simply the text of a hocr file one can use a sed command to delete all tags which are around the actual text:

sed 's/<[^>]*>/ /g' sample.hocr

(Please see also for alternatives of this step.)

2. Calculate word frequencies

Calculate the word frequencies with an awk program as described in the GNU awk's User Guide, section 14.3.5

# wordfreq.awk --- print list of word frequencies

    $0 = tolower($0)    # remove case distinctions
    # remove punctuation
    gsub(/[^[:alnum:]_[:blank:]]/, "", $0)
    for (i = 1; i <= NF; i++)

    for (word in freq)
        printf "%s\t%d\n", word, freq[word]

3. Sort and output

Sort with sort -k 2nr and output the top 10 words with head.

Putting it together

After saving the awk-program to a file wordfreq.awk one can call this altogether with

 sed 's/<[^>]*>/ /g' sample.hocr | awk -f wordfreq.awk | sort -k 2nr | head

The output will then look like this example

the     24
to      21
she     20
it      18
and     15
of      15
a       13
was     12
her     10
down    9