Skip to content

A classifier of Stackoverflow posts into relevant topics/tags.

Notifications You must be signed in to change notification settings

tvasil/stackoverflow-topic-classifier

Repository files navigation

stackoverflow-topic-classifier

This Python project contains the code to train, evaluate and predict from a machine learning model that attempts to identify the appropriate that a StackOverflow post should receive. The idea of the project is the following:

Given some concatentades title + body text of a StackOverflow post such the following:

What are metaclasses in Python?

In Python, what are metaclasses and what do we use them for?

can we predict the tags that the user would have given to the post? Specifically, ['python', 'oop', 'metaclass', 'python-datamodel']

Requirements

The minimal requirements are:

- python 3.8
- scikit-learn 0.23.2
- joblib 0.17
- nltk 3.5

For convenience, I've exported the full environment.yml file from my conda environment. Mind you, that's just for reproducibility and is not used by each of the sub-modules. Those specify dependencies based on a individual setup.py file per module.

Project structure

The project includes 3 pip installable packages, namely:

  • so_tag_classifier_core
  • so_tag_classifier_prediction
  • so_tag_classifier_training

so_tag_classifier_core includes data transformation methods that are used to transform input data for both training and prediction making. It also includes the set of tags we are interested in predicting. For the purpose of simplicity, the classifier is limited to the top 100 labels/tags as extracted on Nov 18, 2020 from a random sample of 100,000 StackOverflow posts. The rest are not predicted. This list can be modified directly in there.

so_tag_classifier_prediction is the package that can be used directly by a service to make predictions. It can be used for other types of problems as well, since the only limitation is that the input is a string. You can import predict.predict to make predictions from elsewhere, or you can run it as a script in the terminal, such as:

python3 predict.py -t "I would like to know how I can aggregate in MySQL efficiently" -mp ~/model.pkl

Note that the model path needs to be provided externally. Currently it needs to be a local path, but I will later add functionality to make it possible to load from S3.

so_tag_classifier_training is a package that can be used to run a training pipeline based on new data inputs. It builds a scikit-learn Pipeline and executes a RandomizedSearchCV for hyperparameter tuning. The set of space to explore can be configured in params/parameter_tuning.py.

You can install any of these packages by running:

cd stackoverflow-topic-classifier
pip install stackoverflow-topic-classifier/core
# pip install stackoverflow-topic-classifier/prediction
# pip install stackoverflow-topic-classifier/training

Prediction

Running the gRPC service

We've called the gRPC service Nostradamus, because it predicts the future 😄, hopefully with better results that the famous astrologer himself. The basic way to run the service is the following:

  1. Ensure all the local packages you need are installed
pip install stackoverflow-topic-classifier/core
pip install stackoverflow-topic-classifier/prediction
  1. Ensure all gRPC dependencies are covered.
pip install grpcio
pip install grpcio-reflection
  1. Run the server in one terminal (note that it is run by default on port 50051)
python nostradamus.service/nostradamus_server.py
  1. And in another terminal, run:
python nostradamus.service/nostradamus_client.py --host localhost --port 50051

Note that the client implementation is just a demo to see how you would ask for a StackOverflow label prediction using the Python stub.

Running with Docker

Simply build the image from the top directory and then run it as follows:

docker build . -t nostradamus # from the stackoverflow-topic-classifier dir
docker run --rm -d -v /`pwd`/models:/models/ -p 50051:50051 --name nostradamus-container nostradamus

This will run your container in detached mode. The service is now up. You can check by running docker ps or making a request via the client, by running python nostradamus.service/nostradamus_client.py --host localhost --port 50051.

Training

To use the training module, you can directly execute the train.py script with some command-line arguments, or import the functions to run a manual training (for example in a notebook).

To run the training, you have two options:

  1. Base model pipeline (the basic parameters are currently harcoded into the _get_training_pipeline function)
  2. Gridsearch CV starting from the base pipeline in #1, while searching through the space defined in so_tag_classifier_training/params/parameter_tuning.py.

You can choose which way you want to run the training by adding (or not) the flag --gs (which stands for GridSearch)

To run training (with grid search):

cd so_tag_classifier_training
mkdir tmp # make sure this folder exists as this is where the models will be saved before being uploaded to S3
python3 so_tag_classifier_training/train.py --train_data /path/to/data.csv --logger_file logs.log --gs

About

A classifier of Stackoverflow posts into relevant topics/tags.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published