Skip to content

Latest commit

 

History

History
49 lines (33 loc) · 3.07 KB

README.md

File metadata and controls

49 lines (33 loc) · 3.07 KB

Bahasa Melayu Natural Language Processing (MelayuNLP) Resource

Collection of Bahasa Malaysia (Malay) Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.

Bahasa Melayu NLP Libraries/Services

Natural Language Toolkit

Library Description Programming Languages License Author & Link
Malaya Natural-Language-Toolkit for Bahasa Malaysia iPython MIT License (MIT) DevconX

Natural Language Pipleline

Library Description Programming Languages License Author & Link
polyglot Polyglot is a natural language pipeline that supports massive multilingual applications such as Transliteration, NER, Sentiment Analysis, Morphological Analysis Python GPLv3 aboSamoor

Part of Speech Tagging (POS Tagging)

API Description Programming Languages License Guide & Link
Malay NLP Frequency Based and Max-ent POS Taggers Malay NLP Blog

Morphology Analysis

Library Description Programming Languages License Author & Link
hltdi-morphology Mirror Repository for ParaMorfo, HornMorpho, AntiMorfo, and MorfoMelayu LowResourceLanguages

Dictionaries / Translation Pairs / Parallel Corpus

Library Description Size Features License Link
MALINDO_Morph Morphological dictionary for Malay / Indonesian English-Malay, English-Indonesian CC BY-NC-SA 4.0 TH english
TALPCo The TUFS Asian Language Parallel Corpus Japanese -> Malay Creative Commons Attribution 4.0 International (CC BY 4.0) license matbahasa
Open Parallel Corpus OPUS is a growing collection of translated texts from the web. Malay <-> Many languages Modified BSD License OPUS

Pre-trained Word Vectors

Pre-trained Model Description Size Dimensions License Link
fastText Skip-Gram model trained on Wikipedia using fastText 300 CC BY-SA 3.0 Facebook + Bin & Text + Text Only
wordvectors Pre-trained word vectors of 30+ languages 173MB 100 MIT License Kyubyong

Not found? Try this.

Malay is currently a low-resource language with few NLP resources out there. Due to its close resemblence to Bahasa Indonesia, it may be useful to try using resources built for Bahasa Indonesia. If you're looking for a place to start, here is a great resource: https://github.com/keyreply/Bahasa-Indo-NLP-Dataset