poetry(see how to install it)
scikit-learnjupyterjupyterlabmatplotlibpandaspython-crfsuite
$ poetry installTraining pipelines are available inside the notebooks/ folder. Each notebook
can be executed and reproduce cell by cell.
- linearCRF: This setting considers all the information available. Features are mentioned inside notebooks in the first cell.
- POSLess: In this setting we excluded the POS tags.
- HMMLike: This setting takes into account the minimum information, i.e. information about the current letter and the immediately preceding one. We use this name because this configuration contains similar information as the HMMs but using CRFs to build the.
Inside notebooks/ folder there are notebook with the postfix
_ejemplos.ipynb for experimental enviroment. Those notebooks are useful to
see pre-trained models in acton.
- L1 = 0.0
- L2 = 0.0
- Max de iterions = 50
- model name:
HMMLike_baseline_k_[1-3].crfsuite
- Delete duplicated lines
$ sort -u corpus > corpus_uniq
- Show duplicated lines
$ diff --color corpus_sort corpus_uniq
To solve encoding/decogding problems with python-crfsuite we
substitute next otomí characters:
- u̱ -> μ
- a̱̱ -> α
- e̱ -> ε
- i̱ -> ι
- Get the glossed corpus
- Text preprocessing
- Make the feature lists for each letter in sentences
- Split test and train sets
- Training and models build
- Tags generations and performance tests
