Scripts for extracting verbal and nominal inflectional paradigms from Apertium transducers for Turkic languages and converting them to the UniMorph schema. This code was used to generate the UniMorph data for Sakha and Tuvan, which was included in the SIGMORPHON 2021 Shared Task 0. Note: the shared task data was generated using the transducer versions from March 2021.
The scripts currently work only for Tuvan and Sakha but should be relatively straightforward to extend to other Turkic languages represented in Apertium.
Please contact mryskina@cs.cmu.edu for any questions.
The corresponding Apertium analyzers must be installed. You can find the installation instructions at the respective repositories:
- Tuvan: apertium-tyv
- Sakha: apertium-sah
Other requirements:
- Python >= 3.6
To run the extraction and conversion pipeline end-to-end, use:
./run.sh {tyv|sah} path/to/apertium/
where /path/to/apertium/
is the path to the directory one level above the transducer directory (path/to/apertium/apertium-{tyv|sah}
).