Prediction of the primary product category from its description.
A multi-class classification problem.
First, the primary product category is found from the product category tree.
Following which any extra characters such as '.' are removed. Numbers are also removed.
According to the plot, the data is highly imbalanced.
The top 5 categories are considered for prediction.
The rest of the categories are dropped for this analysis.
A sentence cannot be used directly for classification and so I am required to tokenize it.
The sentence(here description) is converted to an integer matrix of tokens.
This is done for the training and testing descriptions.
The dataset is split. 80% of the dataset is used for training and 20% is used for testing.
Multinomial Naive Bayes and Linear Support Vector Machines are used.
The accuracy obtained for them is 99.27% and 99.84% respectively.
Use LSTMs and GRUs.
Make use of more features.
$ git clone "https://github.com/Yukti-09/Predictions-From-Descriptions.git"