In this repository you will find:
- a dataset (and associated code to build it) containing names in arabic characters and associated names in latin characters (english),
- a (google colab) notebook to train a Neural Machine Translation (NMT) model based on seq2seq. The objective of this model is to transliterate names in arabic alphabet to latin alphabet. This task is also called romanization.
The model is trained thanks to Google Colab providing (free) GPU.
The model is based on Tensorflow tutorial NMT with attention.
We use 3 datasets:
- Google transliteration data. Example: عادل; adel
- ANETAC dataset. Example: PERSON; Adel; اديل. For this file we'll filter on PERSON only,
- NETranliteration COLING 2018.
These 3 datasets will give us a clean dataset containing names in arabic and corresponding names in latin alphabet (english).
A pre-trained model (arabic to latin characters) is stored on dropbox.
A jupyter notebook is provided to train the model used for transliteration.
A streamlit is provided. You can find a deployed version here.
Install library:
python setup.py install
get-data
: Get data from 3 sources to get a training dataset.get-pretrained-model
: Download pre-trained model for the task.train-nmt-model
: Train an NMT model.transliterate-name
: Transliterate a name in arabic in latin character.
Please refer to the environment.yml
file for conda environment.
To create the environment with conda:
conda env create -f environment.yml