Language

Newly formulated rules: https://docs.google.com/document/d/1yWi67PO7EUHgak7lNI8_U6JF-Typo9qGFQ4Op2t4vGs/edit# Cross check with Keralapanineeyam and have test coverage matching its examples.

Lexicon

Nouns with അ is taken from ml.wiktionary.org വർഗ്ഗം:മലയാളം നാമങ്ങൾ. Others need to add.
Abbreviations not integrated well

TODO

Expand lexicon
More proper nouns - places, people where to get?
Verbs
Documentation
Expand number to cover all edge cases.
Abbreviations
Standardize POS tags and document
Exception dictionary
Use names for variables that are not confusing
include #infl# in the vowel to vowel sign context - as possible tag in the word joining position.
Avoid proper nouns like place names, person names agglutinating with another proper names.

Ideas

A grapheme to phoneme transcription utility, try extending with stress markers
Date parser that accepts strings like നവമ്പർ മാസം പത്താം തീയതി and gives Nov 10. The date in machine readable format.
Same as above for time. വൈകീട്ട് മൂന്നരമണി, രാവിലെ ആറേ മുക്കാൽ. നാളെ രാത്രി പന്ത്രണ്ടരമണി...
IPA transliteration/transcription
Spell checker
Hyphenation
Syllabalzer

Using with HFST

It can also be compiled with Helsinki Finite-State Transducer Technology (HFST) . To compile using HFST tools, set FSTC envirionment to 'hfst', a command like FSTC=hfst make should do the trick. The resulting FSA will be usable with the HFST tools.

Ruby

There is a ruby wrapper for SFST: https://github.com/mlj/ruby-sfst To install gem install ruby-sfst. Sample code:

require "sfst"

fst = SFST::RegularTransducer.new("malayalam.a")
analyse=fst.analyse('നീലത്താമര')
puts analyse

and running it like

$ruby test.rb

gives

`നീല<n>താമര<n>` Nodejs

No known bindings. Worth to write one from scratch?

Debugging

sfst tools are useful for debugging. Here is an example. To debug accusative.fst, it is better to test that file individually. Compiling the whole system and making modifications is very time consuming.

Add the following lines towards the end of the file.

$tests$ = മഴ<n><RB><accusative> | മുറ്റം<n><RB><accusative> |  കിളി<n><RB><accusative> | താൻ<prn><RB><accusative>
$tests$ || $accusative$ >> "accusative-test.a"

Compile the file using sfst: fst-compiler-utf8 accusative.fst accusative.a

Then generate all strings the fst can generate using fst-generate accusative.a. Make sure this list is correct and does not output random values. Producing unwanted items in output will cause bigger time for compositions in other parts of system.

You can keep the above test lines in FST as commented lines after debugging.

ദ്വിത്വസന്ധി

Duplication is more complicated than what is implemented now. Current implmentation is purely a phonological one. But actually, duplication is conditional on the characteristics of participating words.

The duplication of കചടതപശ is when the first word is adjective. So it should happen only at word+word
Duplication also happens after demonstratives. അക്കാര്യം, ഇത്തല, ഇവ്വണ്ണം. But in this case even weak consonants get duplicated - ഇ + വണ്ണം = ഇവ്വണ്ണം, ഈവിധം. Also, elongation of demonstratives is suggested for the weak consontants - ഈവിധം
In case of ദ്വന്ദസമാസം - no duplication happens- ആനകുതിര
മുൻവിനയെച്ചം, ആധാരികാഭാസം, ഇൽ, കൽ പ്രത്യയങ്ങൾ -ഇവയ്ക്കപ്പുറം ഇരട്ടിക്കും. പക്ഷേ തന്നുപോയി എന്നതിൽ എങ്ങനെ പ ഇരട്ടിക്കില്ല എന്നതിനു നിയമമില്ല?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOTES.md

NOTES.md

Language

Lexicon

TODO

Ideas

Using with HFST

Ruby

`നീല<n>താമര<n>` Nodejs

Debugging

ദ്വിത്വസന്ധി

Files

NOTES.md

Latest commit

History

NOTES.md

File metadata and controls

Language

Lexicon

TODO

Ideas

Using with HFST

Ruby

നീല<n>താമര<n> Nodejs

Debugging

ദ്വിത്വസന്ധി

`നീല<n>താമര<n>` Nodejs