Required software:
- OpusTools: library and tools for accessing OPUS data
- LanguageCodes: tools for converting language codes
Optional software:
Data:
- local copy of all OPUS data (set
OPUS_HOME
in the Makefile)
- make sure that the scripts in
scripts/
work as they should and that all software is properly installed - run
make all
to compile the entire corpus and readme-files (or better using parallel threads with, for example four paralle jobs usingmake -j 4 all
) - upload the data to ObjectStorage using a-tools at CSC:
module load allas
allas-conf
make upload
The data set can also be compiled in various steps, for example test/dev sets and training data sets separately:
make -j testdata
make -j traindata
make subsets
- don't require local access to OPUS data (make everything accessible via https://github.com/Helsinki-NLP/OPUS-API)
- get languages available in OPUS from OPUS-API (see OPUS_LANGS in Makefile)
- include baseline systems and recipies for filtering data