Skip to content

time-in-translation/preprocess-corpora

Repository files navigation

preprocess-corpora

This repository contains Python scripts to preprocess and sentence-align parallel (or monolingual) corpora. The repository originally heavily relied upon the software applications Uplug and (in lesser respect) TreeTagger to work. Nowadays, we also have support for NLTK, Spacy, and Stanza.

Installation

First, make sure to have installed Uplug and TreeTagger. Users intending to use NLTK and/or Spacy and/or Stanza can skip this step.

Then, create a virtual environment and activate it:

$ python -m venv venv
$ source venv/bin/activate

Then, install the requirements in this virtual environment via:

$ pip install -r requirements.txt

Finally, create the executables preprocess and align via:

$ pip install --editable .

If you intend to use NLTK for tokenization, be sure to download the Punkt models:

$ python
>>> import nltk
>>> nltk.download('punkt')

If you intend to use Spacy for tokenization and tagging, be sure to download the language models (small models suffice, be sure to replace en_core_web_sm with the model for your language):

$ python -m spacy download en_core_web_sm

With Stanza, models will be downloaded on-the-fly.

Usage

Preprocessing

The preprocess script allows preprocessing raw text and then to tokenize and tag the text in the XML format used in OPUS.

Run preprocess to process all unformatted .txt-files in a folder.

Usage:

$ preprocess [OPTIONS] FOLDER_IN FOLDER_OUT {de|en|nl|sv|ca|es|fr|it|pt|ro|bg|pl|ru|br|hi|ar|mx}

Options:

  • --from_word to use .docx-files as input, rather than .txt-files.
  • --tokenizer to tokenize the files; choose either:
    • uplug (requires installation of Uplug (and language support in Uplug)).
    • nltk (requires installation of the Punkt models (and language support in Punkt))
    • spacy (requires installation of the Spacy models (and language support in Spacy))
    • stanza (requires installation of the Stanza models (and language support in Stanza))
    • treetagger (use the very naive tokenization in the treetagger-xml package (not recommended!))
  • --tag to tag the files (requires installation of Spacy or TreeTagger (and language support))
  • --dialog to detect dialogs in the generated .xml-files.

Alignment

Run align to sentence-align .xml-files in a working directory. Requires installation of Uplug.

Usage:

$ align [OPTIONS] WORKING_DIR [[de|en|nl|sv|ca|es|fr|it|pt|ro|bg|pl|ru|br|hi|ar|mx]]...

Supported languages

Genus Language ISO Preprocessing Tokenization Tagging
Germanic German de
Germanic English en
Germanic Dutch nl
Germanic Swedish sv ✔ (NLTK)
Romance Catalan ca ✔ (Stanza)
Romance Spanish es
Romance French fr
Romance Italian it
Romance Portuguese pt ✔ (Uplug)
Romance Romanian ro ✔ (Stanza)
Hellenic Greek el ✔ (Stanza)
Slavic Belarusian be ✔ (Stanza)
Slavic Bulgarian bg ✔ (Stanza)
Slavic Czech cs ✔ (Stanza)
Slavic Croatian hr ✔ (Spacy)
Slavic Lithuanian lt ✔ (Spacy)
Slavic Latvian lv ✔ (Stanza)
Slavic Macedonian mk ✔ (Spacy)
Slavic Polish pl ✔ (Spacy)
Slavic Russian ru ✔ (Uplug)
Slavic Slovak sk ✔ (Stanza)
Slavic Slovenian sl ✔ (Stanza)
Slavic Serbian sr ✔ (Stanza)
Slavic Ukrainian uk ✔ (Spacy)
Celtic Breton br
Indo-Aryan Hindi hi ✔ (Stanza)

Some comments:

  • For Dutch, for tokenization, Uplug can potentially use Alpino (recommended).
  • For Swedish, consider using Stagger for part-of-speech tagging.
  • Spanish varieties (Mexican Spanish (mx) and Rioplatense Spanish (ar)) are supported by referring to the Spanish parameters.
  • Note that the Portuguese NLTK Punkt parameters are based upon Brazilian Portuguese.
  • For Hindi, we use RNNTagger instead of TreeTagger.

Tests

Run the tests via

$ python -m unittest discover

In preprocess_corpora/tests/data/alice, you can find the example corpus used in the tests. This corpus was compiled from Lewis Carroll's Alice in Wonderland and its translations into German, French, and Italian. The source files are available through Project Gutenberg.

About

Creating (parallel) corpora from scratch using Uplug tooling

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages