Skip to content

Latest commit

 

History

History
19 lines (16 loc) · 1.48 KB

project.org

File metadata and controls

19 lines (16 loc) · 1.48 KB

Arabic Diacritization Project

Dataset:

The dataset used is this dataset. The dataset consists of 50000 lines of diacritized Arabic text of mainly scholarly nature.

Data Preprocessing:

The preprocessing goes through steps:

  1. Split the text into an undiacritized text and its diacritics. The diacritics of each word are mapped into an 8-bit Integer. Each sentence and its diacritics(Vector of 8-bit Integers) are then stored in a data structure called “DiacritizedText”. During this process, numbers and punctuation are removed.
  2. Split the sentence into words.
  3. Analysis of the words; We run CAMel’s morphological analysis tool on each word to obtain the morphological features. Along with morphological features, we also construct an array of encountered word stems, so each stem is mapped to an integer. The morphological features along with the index of the stem and the length of the word are put into a word_arr, an array of 64-bit Integers of length 19.

Improve analysis selection

The morphological analysis tool outputs several potential analyses. The current criteria is to pick the one with the highest prob of occuring in CAMel’s dataset. This is obviously false, especially considering context is not taken into account. A solution for this may mean creating a part-of-speech tagger.

Current model:

The current model looks for each word and its encountered diacritizations and chooses the most common one.