Bahasa Melayu Natural Language Processing (MelayuNLP) Resource

Collection of Bahasa Malaysia (Malay) Natural Language Processing (NLP) software libraries, dictionaries, and corpus. Always welcome for pull requests.

Bahasa Melayu NLP Libraries/Services

Natural Language Toolkit

Library	Description	Programming Languages	License	Author & Link
Malaya	Natural-Language-Toolkit for Bahasa Malaysia	iPython	MIT License (MIT)	DevconX

Natural Language Pipleline

Library	Description	Programming Languages	License	Author & Link
polyglot	Polyglot is a natural language pipeline that supports massive multilingual applications such as Transliteration, NER, Sentiment Analysis, Morphological Analysis	Python	GPLv3	aboSamoor

Part of Speech Tagging (POS Tagging)

API	Description	Programming Languages	License	Guide & Link
Malay NLP	Frequency Based and Max-ent POS Taggers			Malay NLP Blog

Morphology Analysis

Library	Description	Programming Languages	License	Author & Link
hltdi-morphology	Mirror Repository for ParaMorfo, HornMorpho, AntiMorfo, and MorfoMelayu			LowResourceLanguages

Dictionaries / Translation Pairs / Parallel Corpus

Library	Description	Features	License	Link
MALINDO_Morph	Morphological dictionary for Malay / Indonesian	English-Malay, English-Indonesian	CC BY-NC-SA 4.0 TH	english
TALPCo	The TUFS Asian Language Parallel Corpus	Japanese -> Malay	Creative Commons Attribution 4.0 International (CC BY 4.0) license	matbahasa
Open Parallel Corpus	OPUS is a growing collection of translated texts from the web.	Malay <-> Many languages	Modified BSD License	OPUS

Pre-trained Word Vectors

Pre-trained Model	Description	Size	Dimensions	License	Link
fastText	Skip-Gram model trained on Wikipedia using fastText		300	CC BY-SA 3.0	Facebook + Bin & Text + Text Only
wordvectors	Pre-trained word vectors of 30+ languages	173MB	100	MIT License	Kyubyong

Not found? Try this.

Malay is currently a low-resource language with few NLP resources out there. Due to its close resemblence to Bahasa Indonesia, it may be useful to try using resources built for Bahasa Indonesia. If you're looking for a place to start, here is a great resource: https://github.com/keyreply/Bahasa-Indo-NLP-Dataset

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Bahasa Melayu Natural Language Processing (MelayuNLP) Resource

Bahasa Melayu NLP Libraries/Services

Natural Language Toolkit

Natural Language Pipleline

Part of Speech Tagging (POS Tagging)

Morphology Analysis

Dictionaries / Translation Pairs / Parallel Corpus

Pre-trained Word Vectors

Not found? Try this.

Files

README.md

Latest commit

History

README.md

File metadata and controls

Bahasa Melayu Natural Language Processing (MelayuNLP) Resource

Bahasa Melayu NLP Libraries/Services

Natural Language Toolkit

Natural Language Pipleline

Part of Speech Tagging (POS Tagging)

Morphology Analysis

Dictionaries / Translation Pairs / Parallel Corpus

Pre-trained Word Vectors

Not found? Try this.