- Project Overview
- Installation
- Data Exploration and Visualization
- Implementation
- Results and Metrics
- Acknowledgements
Punctuations are symbols used to organize written textual to make it clear and easy to read. Lack of punctuation could cause confusion and misunderstanding for the reader. This problem may occur when using an Automatic Speech Recognition (ASR), the ASR system doesn’t predict punctuation usually. The punctuation prediction is one of the Natural Language Processing (NLP) problems. The aim of this project is to build a Sequence to sequence model to predict punctuation in Arabic texts.
- Python versions 3.*.
- Python Libraries:
- matplotlib.
- tensorflow.
- sklearn.
- Pandas.
- keras.
- numpy.
- string.
- nltk
- tqdm
- time.
- re.
Tashkeela dataset, the dataset contains over 75 Arabic words obtained from 97 books. Also, it contains different types of punctuation. So, we will use it in this project. We focus on five punctuation marks: comma, dot, semicolon, question mark, two vertical points. Below are some texts from Al-Bahr Al-Muhit book.
وَقَالَ الْأُسْتَاذُ أَبُو مَنْصُورٍ : الْغَرَضُ مِنْ أُصُولِ الْفِقْهِ مَعْرِفَةُ أَدِلَّةِ أَحْكَامِ الْفِقْهِ ، وَمَعْرِفَةُ طُرُقِ الْأَدِلَّةِ ، لِأَنَّ مَنْ اسْتَقْرَأَ أَبْوَابَهُ وَجَدَهَا إمَّا دَلِيلًا عَلَى حُكْمٍ أَوْ طَرِيقًا يُتَوَصَّلُ بِهِ إلَى مَعْرِفَةِ الدَّلِيلِ ، وَذَلِكَ كَمَعْرِفَةِ النَّصِّ ، وَالْإِجْمَاعِ ، وَالْقِيَاسِ ، وَالْعِلَلِ ، وَالرُّجْحَانِ . وَهَذِهِ كُلُّهَا مَعْرِفَةٌ مُحِيطَةٌ بِالْأَدِلَّةِ الْمَنْصُوصَةِ عَلَى الْأَحْكَامِ . وَمَعْرِفَةُ الْأَخْبَارِ وَطُرُقِهَا مَعْرِفَةٌ بِالطُّرُقِ الْمُوَصِّلَةِ إلَى الدَّلَائِلِ الْمَنْصُوصَةِ عَلَى الْأَحْكَامِ .وَهَاهُنَا أُمُورٌ : أَحَدُهَا : أَنَّ الْأَسْمَاءَ الْمُسْتَعْمَلَةَ فِي هَذِهِ الْعُلُومِ . كَأُصُولِ الْفِقْهِ ، وَالْفِقْهِ ، وَالنَّحْوِ ، وَاللُّغَةِ ، وَالطِّبِّ . هَلْ هِيَ مَنْقُولَةٌ أَوْ لَا ؟
In this project, we used the sequence to sequence (seq2seq) model the same technique used of Neural Machine Translation (NMT) provided by TensorFlow. The inputs pass through the encoder model to give s us the encoder output and encoder hidden state. Bahdanau Attention has been used for the encoder.
We trained our model using 10 epoch with 128 batch size. We use categorical cross-entropy loss function and Adam optimizer. A Bilingual Evaluation Understudy (BLEU). In the below figure, the results of the prediction for two examples.
I wish to thank Tashkeela for dataset. Also, thanks for Udacity for advice. For more details about this project can read this article.