This repository contains Python scripts to preprocess and sentence-align parallel (or monolingual) corpora. The repository originally heavily relied upon the software applications Uplug and (in lesser respect) TreeTagger to work. Nowadays, we also have support for NLTK, Spacy, and Stanza.
First, make sure to have installed Uplug and TreeTagger. Users intending to use NLTK and/or Spacy and/or Stanza can skip this step.
Then, create a virtual environment and activate it:
$ python -m venv venv
$ source venv/bin/activate
Then, install the requirements in this virtual environment via:
$ pip install -r requirements.txt
Finally, create the executables preprocess and align via:
$ pip install --editable .
If you intend to use NLTK for tokenization, be sure to download the Punkt models:
$ python
>>> import nltk
>>> nltk.download('punkt')
If you intend to use Spacy for tokenization and tagging, be sure to download the language models (small models suffice, be sure to replace en_core_web_sm with the model for your language):
$ python -m spacy download en_core_web_sm
With Stanza, models will be downloaded on-the-fly.
The preprocess script allows preprocessing raw text and then to tokenize and tag the text in the XML format used in OPUS.
Run preprocess to process all unformatted .txt-files in a folder.
Usage:
$ preprocess [OPTIONS] FOLDER_IN FOLDER_OUT {de|en|nl|sv|ca|es|fr|it|pt|ro|bg|pl|ru|br|hi|ar|mx}
Options:
--from_wordto use .docx-files as input, rather than .txt-files.--tokenizerto tokenize the files; choose either:uplug(requires installation of Uplug (and language support in Uplug)).nltk(requires installation of the Punkt models (and language support in Punkt))spacy(requires installation of the Spacy models (and language support in Spacy))stanza(requires installation of the Stanza models (and language support in Stanza))treetagger(use the very naive tokenization in the treetagger-xml package (not recommended!))
--tagto tag the files (requires installation of Spacy or TreeTagger (and language support))--dialogto detect dialogs in the generated .xml-files.
Run align to sentence-align .xml-files in a working directory. Requires installation of Uplug.
Usage:
$ align [OPTIONS] WORKING_DIR [[de|en|nl|sv|ca|es|fr|it|pt|ro|bg|pl|ru|br|hi|ar|mx]]...
| Genus | Language | ISO | Preprocessing | Tokenization | Tagging |
|---|---|---|---|---|---|
| Germanic | German | de | ✔ | ✔ | ✔ |
| Germanic | English | en | ✔ | ✔ | ✔ |
| Germanic | Dutch | nl | ✔ | ✔ | ✔ |
| Germanic | Swedish | sv | ✔ | ✔ (NLTK) | ✗ |
| Romance | Catalan | ca | ✔ | ✔ (Stanza) | ✔ |
| Romance | Spanish | es | ✔ | ✔ | ✔ |
| Romance | French | fr | ✔ | ✔ | ✔ |
| Romance | Italian | it | ✔ | ✔ | ✔ |
| Romance | Portuguese | pt | ✔ | ✔ (Uplug) | ✔ |
| Romance | Romanian | ro | ✔ | ✔ (Stanza) | ✔ |
| Hellenic | Greek | el | ✔ | ✔ (Stanza) | ✔ |
| Slavic | Belarusian | be | ✔ | ✔ (Stanza) | ✔ |
| Slavic | Bulgarian | bg | ✔ | ✔ (Stanza) | ✔ |
| Slavic | Czech | cs | ✔ | ✔ (Stanza) | ✔ |
| Slavic | Croatian | hr | ✔ | ✔ (Spacy) | ✔ |
| Slavic | Lithuanian | lt | ✔ | ✔ (Spacy) | ✔ |
| Slavic | Latvian | lv | ✔ | ✔ (Stanza) | ✔ |
| Slavic | Macedonian | mk | ✔ | ✔ (Spacy) | ✔ |
| Slavic | Polish | pl | ✔ | ✔ (Spacy) | ✔ |
| Slavic | Russian | ru | ✔ | ✔ (Uplug) | ✔ |
| Slavic | Slovak | sk | ✔ | ✔ (Stanza) | ✔ |
| Slavic | Slovenian | sl | ✔ | ✔ (Stanza) | ✔ |
| Slavic | Serbian | sr | ✔ | ✔ (Stanza) | ✗ |
| Slavic | Ukrainian | uk | ✔ | ✔ (Spacy) | ✔ |
| Celtic | Breton | br | ✔ | ✗ | ✗ |
| Indo-Aryan | Hindi | hi | ✔ | ✔ (Stanza) | ✔ |
Some comments:
- For Dutch, for tokenization, Uplug can potentially use Alpino (recommended).
- For Swedish, consider using Stagger for part-of-speech tagging.
- Spanish varieties (Mexican Spanish (mx) and Rioplatense Spanish (ar)) are supported by referring to the Spanish parameters.
- Note that the Portuguese NLTK Punkt parameters are based upon Brazilian Portuguese.
- For Hindi, we use RNNTagger instead of TreeTagger.
Run the tests via
$ python -m unittest discover
In preprocess_corpora/tests/data/alice, you can find the example corpus used in the tests.
This corpus was compiled from Lewis Carroll's Alice in Wonderland and its translations into German, French, and Italian.
The source files are available through Project Gutenberg.