Before running notebooks, we first need to download all the data we will be using.
As always, the first step is to clone the repository:
>> git clone https://github.com/JackShen1/sensus.git
Learning datasets now include 1,000 positive and 1,000 negative book reviews. Originally, this data was taken from a large dataset with a review from Amazon, you can download it here. And then reviews of books were translated with the help of Google Translator into Ukrainian and slightly edited by me. Raw reviews can be found in the data/
folder.
Since there is no support for the Ukrainian language in the NLTC library, we will take a different path. The most complete list of Ukrainian stop words was found here and they were used in this project.
Also at the processing stage (part 1) a stemmer was used for comparison, for good we would use PorterStemmer from nltk.stem, but for obvious reasons we can't. But this is not a problem, because writing your own PorterStemmer realization is not so difficult, so we wrote it for Python based on this PHP code.
And the last thing we need to download is a Word2Vec model. For simplicity, we will use a pretrained Word2Vec model with Ukrainian words-vectors, each of which has a dimension of 300. We chose the lematized version of this model because we already have our sample, which we processed in the part 1, which would fit perfectly here. The model can be found on this website. After downloading, unzip the bz2
archive (~1Gb), for example using this application;
In order to run the iPython notebook, you'll need Python (v3.6+
) and the following libraries:
- Keras (
v2.4+
) - Gensim (
v3.8+
) - Pandas (
v1.2+
) - NumPy (
v1.19.5+
) - NLTK (
v3.5+
) - python-decouple (
v3.4+
) - pymorphy2-dicts-uk (
v2.4.1+
) - pymorphy2 (
v0.9+
) - scikit-learn (
v0.24.1
) - SciPy (
v0.19.1+
) - Matplotlib (
v2.1.1+
) - Jupyter
The commands for installing these libraries will follow. First, let's create a virtual environment.
The easiest way to install Keras
, Gensim
, NumPy
, Jupyter
, matplotlib
and our other libraries is to start with the Anaconda Python distribution.
-
Select your OS and follow the installation instructions for Anaconda Python. We recommend using Python 3.6+ (64-bit).
-
Install the Python development environment on your system:
>> pip install -U pip virtualenv
-
If you haven't done so already, download and unzip this entire repository from GitHub:
>> git clone https://github.com/JackShen1/sensus.git
-
Use
cd
to navigate into the top directory of the repo on your machine. -
Open Anaconda Promt and install JupyterLab, also enter the following commands:
>> conda install -c conda-forge jupyterlab # install JupyterLab >> conda create -n sensus pip python=3.7 # choose the Python version >> source activate sensus # activate the virtual environment
Alternatively, you can install Jupyter with pip:
pip install jupyterlab
-
Now we can install all the libraries we need:
>> pip install Keras gensim pandas numpy nltk python-decouple scikit-learn scipy matplotlib pymorphy2 >> pip install -U pymorphy2-dicts-uk # dictionary for the Ukrainian language
-
Launch Jupyter by entering:
>> jupyter notebook
Once you have everything installed, the next time to activate everything, do the following:
-
Open Anaconda Prompt and enter the project folder with the
cd
command. Now enter the following commands:>> conda activate sensus >> jupyter notebook
In this project in 3 parts the whole process of data preparation and training of our model was described, the comparative analysis of classifiers and various models is carried out. Each stage is accompanied by data visualization. The results are good, as for such small datasets with not very accurate translation. In the future, I will expand the datasets and correct the translation. In everything else, the project works perfectly and can be easily adapted to English or Russian. Read the detailed description in notebooks.