This repository contains State of the Art Language models and Classifier for Malayalam, which is spoken by the Malayali people in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry.
The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)
- iNLTK Headlines Corpus - Malayalam : Uses the Malayalam News Dataset prepared above
Architecture/Dataset | Malayalam Wikipedia Articles |
---|---|
ULMFiT | 26.39 |
TransformerXL | 25.79 |
Dataset | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|
iNLTK Headlines Corpus - Malayalam | 95.56 | 93.29 | Link |
Architecture | Visualization |
---|---|
ULMFiT | Embeddings projection |
TransformerXL | Embeddings projection |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Malayalam | (5036, 630, 630) | 95.56 | 93.29 | Link |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Malayalam | (503, 630, 630) | 82.38 | 73.47 | Link |
Dataset | Dataset size (train, valid, test) | Accuracy | MCC | Notebook to Reproduce results |
---|---|---|---|---|
iNLTK Headlines Corpus - Malayalam | (503, 630, 630) | 84.29 | 76.36 | Link |
Download pretrained Language Model from here
Trained tokenizer using Google's sentencepiece
Download the trained model and vocabulary from here