This is a collection of command-line corpus tools.
For convenience it also includes submodules from Ken's preprocess and Rico's BPE repos. To include submodules when cloning, add the --recursive
flag:
git clone --recursive https://github.com/jonsafari/habeas-corpus
Many of the scripts have a command-line argument --help for usage information, so often you can type the following for more specific help:
./myscript.sh --helpMost of these scripts take their input from stdin, and output text to stdout, so the Unix command-line usage for many of these scripts is:
./myscript.sh < input.txt > output.txtOr you can pipe these commands with other commands.
- allcat - Works like cat regardless of whether the file is plaintext, .gz, .bz2, .lzma, or .xz (best)
- char_freq.sh - Tabulates the frequency of characters in a text
- corpus_get.sh - Builds a corpus from a remote website, recursively downloading all webpages
- generate_language.sh - Randomly generates text, given a language model and vocabulary
- mediawiki_dict.sh - Converts MediaWiki dumps to bilingual dictionary. You should use the wikidump from the smaller language
- par_map.sh - Parallel Map a command for either a single file or multiple files (i.e. parallelize a command)
- rev_words.pl - Reverses word order in each line. For example "how are you?" becomes "you? are how"
- Preprocessing:
- digit_conflate.pl - Conflates all numerical digits to a single digit. For example 48,250.75 -> 55,555.55
- lowercase.pl - Lowercases all texts. Works on almost all bicameral orthographies
- Tok-tok - General tokenizer, suitable for many languages
- uppercase.pl - Uppercases all texts. Works on almost all bicameral orthographies
- Vocabulary extraction:
- vocab.sh - Lists the vocabulary (set of unique words) from a text corpus
- vocab_top.sh - Lists a frequency-sorted vocabulary (set of unique words) from a text corpus
- vocab_filter.py - Replaces infrequent tokens with
<unk>
- word2int.py - Converts words to integers, online
- Experiment management:
- generate_splits.pl - Generates train/dev/test splits from a whole corpus. Every n lines goes to the training set, then one to the development set, then one to the test set
- subcorpora.pl - Builds subcorpora from a whole corpus, increasing in size exponentially
- Penn Treebank formatting:
- penn2conll.sh - Converts Penn treebank format to POS-tagged CoNLL-X format
- penn2plain.pl - Converts Penn treebank format to plaintext
- penn2qtree.sh - Converts Penn treebank format to Qtree format for use in LaTeX documents
- Character set encoding:
- buckwalter2unicode.pl - Converts from Buckwalter transliteration to UTF-8 native Arabic script
- Classical cryptanalysis:
- pivot.pl - Rotates text by 90 degrees
- playfair_digraph_freq.sh - Tabulates Playfair-style digraph character frequencies
- grep - Search for a pattern in text. All of the command-line arguments are useful, especially
-r
,-i
,-c
,-e
,-v
,-o
,-w
(spells ricevow) - shuf - Randomizes the lines of the input. For reproducible pseudo-randomization, try
--random-source=input.txt
- sort - Sorts the lines of the input. For large corpora, use
LANG=C sort --buffer-size=4000M --temporary-directory=./
- tac - Reverses line-order of the input
- wc - Counts number of lines, words (tokens), and characters. The argument
--max-line-length
is also useful.