Tag Prediction from Stack Overflow Questions

Student project on course Machine Learning, Master Degree Studies, University of Belgrade, Faculty of Mathematics

About The Project

The project is inspired by CS229 Student Final Project:

Tag Prediction from Stack Overflow Questions - Jalal Buckley, Kevin Fuhs, Reid M. Whitaker (poster, report)

The task and the data are given in Kaggle competition: Facebook Recruiting III - Keyword Extraction.

Task description (Description from competition overview):

The task is to predict the tags (a.k.a. keywords, topics, summaries), given only the question text and its title. The dataset contains content from disparate stack exchange sites, containing a mix of both technical and non-technical questions.

Data description (Description from competition overview):

All of the data is in 2 files: Train and Test. Train.csv contains 4 columns: Id,Title,Body,Tags

Id - Unique identifier for each question
Title - The question's title
Body - The body of the question
Tags - The tags associated with the question (all lowercase, should not c contain tabs '\t' or ampersands '&')
Test.csv contains the same columns but without the Tags, which you are to predict.

The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Languages and technologies used

Implementation is in Python 3.7.
Code was written using Google colab, Anaconda, The Jupyter Notebook.
scikit-multilearn library for Multi-label Classification was used.
Other packages and libraries used: Matplotlib, NumPy, pandas, Beautiful Soup, NLTK, SciPy, scikit-learn, Keras.
Markdown editor for README: StackEdit

Notes

We used an unacceptably small subset of data due to a lack of hardware capabilities. In the future, models should be trained on a larger dataset and the results compared.
In the future, we can try with some more complex NLP word embeddings and models.
The Heuristic, since it turned out to be very good, will continue to be used in the future. There are relations between tags for one question, so when you have some tags, it is easier to predict more of them. So The Heuristic could be the starting point, upgraded with some ML.

Trained models

Our trained models and vectorizers are available here: Google Drive Shared Folder

References

Inspiration for the project: "Tag Prediction from Stack Overflow Questions" - Jalal Buckley, Kevin Fuhs, Reid M. Whitaker (poster, report)
Dataset: Facebook Recruiting III - Keyword Extraction

Contact

Marija Katić | katic.marija.97@gmail.com, mr16032@alas.matf.bg.ac.rs | https://github.com/marijakatic

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
assets		assets
.gitignore		.gitignore
01 Problem and Data Analysis.ipynb		01 Problem and Data Analysis.ipynb
02 Text Preprocessing.ipynb		02 Text Preprocessing.ipynb
03 Feature selection, Heuristic implemented, Train-Test split stratification problem.ipynb		03 Feature selection, Heuristic implemented, Train-Test split stratification problem.ipynb
04 Models - Transformation into binary classification.ipynb		04 Models - Transformation into binary classification.ipynb
05 Models - Adapted algorithms.ipynb		05 Models - Adapted algorithms.ipynb
06 Final model.ipynb		06 Final model.ipynb
README.md		README.md
utility.py		utility.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tag Prediction from Stack Overflow Questions

About The Project

Languages and technologies used

Notes

Trained models

References

Contact

About

Releases

Packages

Languages

marijakatic/Stack-Overflow-Tag-Predictor

Folders and files

Latest commit

History

Repository files navigation

Tag Prediction from Stack Overflow Questions

About The Project

Languages and technologies used

Notes

Trained models

References

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages