A Natural Language Processing (NLP) pipeline for processing large Arabic text corpora, including tokenization, stop word removal, dictionary verification, stemming, lemmatization, named entity recognition (NER), and the creation of word/sequence matrices for predictive text applications.
- Merges thousands of text files into a single corpus.
- Tokenizes Arabic text and removes stop words.
- Verifies tokens against a custom Arabic dictionary.
- Applies stemming and lemmatization.
- Performs Named Entity Recognition (NER) using Farasa.
- Saves vocabulary and generates word and sequence matrices for next-word/sequence prediction.
- Provides a Flask API endpoint for sequence prediction.
- Includes a simple web interface for user interaction.
- Python 3.8+
- NLTK
- Farasa Segmenter & NER
- Flask
- flask-cors
Install dependencies:
pip install nltk flask flask-cors farasaPlace your raw .txt files in the Sports directory.
Process the corpus and generate vocabulary and matrices:
python pp.pyServe the prediction API and web interface:
python appp.pyVisit http://localhost:5000 in your browser.
Use the web interface or send a POST request to /predict with a JSON body:
{ "word": "YOUR_ARABIC_WORD" }-
vocabulaire.txt: Filtered vocabulary (named entities).
-
word_matrix.json: Next-word prediction matrix.
-
sequence_matrix.json: Next-sequence prediction matrix.
-
Ensure dictt.txt and
stop_arabic.txtare present and properly formatted. -
Farasa requires Java; see Farasa documentation for setup.
-
For large corpora, adjust the number of files processed in pp.py as needed.
This project is for educational and research purposes.