Student project on course Machine Learning, Master Degree Studies, University of Belgrade, Faculty of Mathematics
The project is inspired by CS229 Student Final Project:
Tag Prediction from Stack Overflow Questions - Jalal Buckley, Kevin Fuhs, Reid M. Whitaker (poster, report)
The task and the data are given in Kaggle competition: Facebook Recruiting III - Keyword Extraction.
Task description (Description from competition overview):
The task is to predict the tags (a.k.a. keywords, topics, summaries), given only the question text and its title. The dataset contains content from disparate stack exchange sites, containing a mix of both technical and non-technical questions.
Data description (Description from competition overview):
All of the data is in 2 files: Train and Test. Train.csv contains 4 columns: Id,Title,Body,Tags
- Id - Unique identifier for each question
- Title - The question's title
- Body - The body of the question
- Tags - The tags associated with the question (all lowercase, should not c contain tabs '\t' or ampersands '&')
- Test.csv contains the same columns but without the Tags, which you are to predict.
The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).
- Implementation is in Python 3.7.
- Code was written using Google colab, Anaconda, The Jupyter Notebook.
- scikit-multilearn library for Multi-label Classification was used.
- Other packages and libraries used: Matplotlib, NumPy, pandas, Beautiful Soup, NLTK, SciPy, scikit-learn, Keras.
- Markdown editor for README: StackEdit
- We used an unacceptably small subset of data due to a lack of hardware capabilities. In the future, models should be trained on a larger dataset and the results compared.
- In the future, we can try with some more complex NLP word embeddings and models.
- The Heuristic, since it turned out to be very good, will continue to be used in the future. There are relations between tags for one question, so when you have some tags, it is easier to predict more of them. So The Heuristic could be the starting point, upgraded with some ML.
Our trained models and vectorizers are available here: Google Drive Shared Folder
- Inspiration for the project: "Tag Prediction from Stack Overflow Questions" - Jalal Buckley, Kevin Fuhs, Reid M. Whitaker (poster, report)
- Dataset: Facebook Recruiting III - Keyword Extraction