Skip to content

Machine learning project. Tag prediction from Stack Overflow questions.

Notifications You must be signed in to change notification settings

marijakatic/Stack-Overflow-Tag-Predictor

Repository files navigation

Tag Prediction from Stack Overflow Questions

Student project on course Machine Learning, Master Degree Studies, University of Belgrade, Faculty of Mathematics

About The Project

The project is inspired by CS229 Student Final Project:

Tag Prediction from Stack Overflow Questions - Jalal Buckley, Kevin Fuhs, Reid M. Whitaker (poster, report)

The task and the data are given in Kaggle competition: Facebook Recruiting III - Keyword Extraction.

Task description (Description from competition overview):

The task is to predict the tags (a.k.a. keywords, topics, summaries), given only the question text and its title. The dataset contains content from disparate stack exchange sites, containing a mix of both technical and non-technical questions.

Data description (Description from competition overview):

All of the data is in 2 files: Train and Test. Train.csv contains 4 columns: Id,Title,Body,Tags

  • Id - Unique identifier for each question
  • Title - The question's title
  • Body - The body of the question
  • Tags - The tags associated with the question (all lowercase, should not c contain tabs '\t' or ampersands '&')
  • Test.csv contains the same columns but without the Tags, which you are to predict.

The questions are randomized and contains a mix of verbose text sites as well as sites related to math and programming. The number of questions from each site may vary, and no filtering has been performed on the questions (such as closed questions).

Languages and technologies used

Notes

  • We used an unacceptably small subset of data due to a lack of hardware capabilities. In the future, models should be trained on a larger dataset and the results compared.
  • In the future, we can try with some more complex NLP word embeddings and models.
  • The Heuristic, since it turned out to be very good, will continue to be used in the future. There are relations between tags for one question, so when you have some tags, it is easier to predict more of them. So The Heuristic could be the starting point, upgraded with some ML.

Trained models

Our trained models and vectorizers are available here: Google Drive Shared Folder

References

Contact

About

Machine learning project. Tag prediction from Stack Overflow questions.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published