preprocess-corpora

This repository contains Python scripts to preprocess and sentence-align parallel (or monolingual) corpora. The repository originally heavily relied upon the software applications Uplug and (in lesser respect) TreeTagger to work. Nowadays, we also have support for NLTK, Spacy, and Stanza.

Installation

First, make sure to have installed Uplug and TreeTagger. Users intending to use NLTK and/or Spacy and/or Stanza can skip this step.

Then, create a virtual environment and activate it:

$ python -m venv venv
$ source venv/bin/activate

Then, install the requirements in this virtual environment via:

$ pip install -r requirements.txt

Finally, create the executables preprocess and align via:

$ pip install --editable .

If you intend to use NLTK for tokenization, be sure to download the Punkt models:

$ python
>>> import nltk
>>> nltk.download('punkt')

If you intend to use Spacy for tokenization and tagging, be sure to download the language models (small models suffice, be sure to replace en_core_web_sm with the model for your language):

$ python -m spacy download en_core_web_sm

With Stanza, models will be downloaded on-the-fly.

Usage

Preprocessing

The preprocess script allows preprocessing raw text and then to tokenize and tag the text in the XML format used in OPUS.

Run preprocess to process all unformatted .txt-files in a folder.

Usage:

$ preprocess [OPTIONS] FOLDER_IN FOLDER_OUT {de|en|nl|sv|ca|es|fr|it|pt|ro|bg|pl|ru|br|hi|ar|mx}

Options:

--from_word to use .docx-files as input, rather than .txt-files.
--tokenizer to tokenize the files; choose either:
- uplug (requires installation of Uplug (and language support in Uplug)).
- nltk (requires installation of the Punkt models (and language support in Punkt))
- spacy (requires installation of the Spacy models (and language support in Spacy))
- stanza (requires installation of the Stanza models (and language support in Stanza))
- treetagger (use the very naive tokenization in the treetagger-xml package (not recommended!))
--tag to tag the files (requires installation of Spacy or TreeTagger (and language support))
--dialog to detect dialogs in the generated .xml-files.

Alignment

Run align to sentence-align .xml-files in a working directory. Requires installation of Uplug.

Usage:

$ align [OPTIONS] WORKING_DIR [[de|en|nl|sv|ca|es|fr|it|pt|ro|bg|pl|ru|br|hi|ar|mx]]...

Supported languages

Genus	Language	ISO	Preprocessing	Tokenization	Tagging
Germanic	German	de	✔	✔	✔
Germanic	English	en	✔	✔	✔
Germanic	Dutch	nl	✔	✔	✔
Germanic	Swedish	sv	✔	✔ (NLTK)	✗
Romance	Catalan	ca	✔	✔ (Stanza)	✔
Romance	Spanish	es	✔	✔	✔
Romance	French	fr	✔	✔	✔
Romance	Italian	it	✔	✔	✔
Romance	Portuguese	pt	✔	✔ (Uplug)	✔
Romance	Romanian	ro	✔	✔ (Stanza)	✔
Hellenic	Greek	el	✔	✔ (Stanza)	✔
Slavic	Belarusian	be	✔	✔ (Stanza)	✔
Slavic	Bulgarian	bg	✔	✔ (Stanza)	✔
Slavic	Czech	cs	✔	✔ (Stanza)	✔
Slavic	Croatian	hr	✔	✔ (Spacy)	✔
Slavic	Lithuanian	lt	✔	✔ (Spacy)	✔
Slavic	Latvian	lv	✔	✔ (Stanza)	✔
Slavic	Macedonian	mk	✔	✔ (Spacy)	✔
Slavic	Polish	pl	✔	✔ (Spacy)	✔
Slavic	Russian	ru	✔	✔ (Uplug)	✔
Slavic	Slovak	sk	✔	✔ (Stanza)	✔
Slavic	Slovenian	sl	✔	✔ (Stanza)	✔
Slavic	Serbian	sr	✔	✔ (Stanza)	✗
Slavic	Ukrainian	uk	✔	✔ (Spacy)	✔
Celtic	Breton	br	✔	✗	✗
Indo-Aryan	Hindi	hi	✔	✔ (Stanza)	✔

Some comments:

For Dutch, for tokenization, Uplug can potentially use Alpino (recommended).
For Swedish, consider using Stagger for part-of-speech tagging.
Spanish varieties (Mexican Spanish (mx) and Rioplatense Spanish (ar)) are supported by referring to the Spanish parameters.
Note that the Portuguese NLTK Punkt parameters are based upon Brazilian Portuguese.
For Hindi, we use RNNTagger instead of TreeTagger.

Tests

Run the tests via

$ python -m unittest discover

In preprocess_corpora/tests/data/alice, you can find the example corpus used in the tests. This corpus was compiled from Lewis Carroll's Alice in Wonderland and its translations into German, French, and Italian. The source files are available through Project Gutenberg.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
preprocess_corpora		preprocess_corpora
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.in		requirements.in
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

preprocess-corpora

Installation

Usage

Preprocessing

Alignment

Supported languages

Tests

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

time-in-translation/preprocess-corpora

Folders and files

Latest commit

History

Repository files navigation

preprocess-corpora

Installation

Usage

Preprocessing

Alignment

Supported languages

Tests

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages