Skip to content

Latest commit

 

History

History
 
 

textcat-docs-issues

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Text classification: Predicting whether a GitHub issue is about docs

This directory contains the datasets and scripts for an example project using Prodigy to train a binary text classifier with exclusive classes to predict whether a GitHub issue title is about documentation.

We've limited our experiments to spaCy, but you can use the annotations in any other text classification system instead. If you run the experiments, please let us know! Feel free to submit a pull request with your scripts.

🧮 Results

Model F-Score # Examples
spaCy
blank
88.8 661
spaCy
en_vectors_web_lg
91.9 661

📚 Data

Labelling the data with Prodigy took about two hours and was done manually using the binary classification interface. The raw text was sourced from the from the GitHub API using the search queries "docs", "documentation", "readme" and "instructions".

File Count Description
docs_issues_training.jsonl 661 Training data annotated with DOCUMENTATION label.
docs_issues_eval.jsonl 500 Evaluation data annotated with DOCUMENTATION label.

Training and evaluation data format

The training and evaluation datasets are distributed in Prodigy's simple JSONL (newline-delimited JSON) format. Each entry contains a "text", the "label" and an "answer" ("accept" if the label applies, "reject" if it doesn't apply). Here are two simplified example entries:

{
  "text": "Add FAQ's to the documentation",
  "label": "DOCUMENTATION",
  "answer": "accept"
}
{
  "text": "Proposal: deprecate SQTagUtil.java",
  "label": "DOCUMENTATION",
  "answer": "reject"
}

Data creation workflow

prodigy mark docs_issues_data ./raw_text.jsonl --label DOCUMENTATION --view-id classification

🚘🐱 Live demo and model download

We also trained a model using Allen AI's Autocat app (a web-based tool for training, visualizing and showcasing spaCy text classification models). You can try out the classifier in real-time and see the updated predictions as you type. You can also evaluate it on your own data, download the model Python package or just pip install it with one command to try it locally. View model here.

To use the JSONL data in Autocat, we added "labels": ["DOCUMENTATION"] to all examples with "answer": "accept" and "labels": ["N/A"] to all examples with "answer": "reject".

🎛 Scripts

The scripts_spacy.py file includes command line scripts for training and evaluating spaCy models using the data in Prodigy's format. This should let you reproduce our results. We tried to keep the scripts as straightforward as possible. To see the available arguments, you can run python scripts_spacy.py [command] --help.

Command Description
train Train a model from Prodigy annotations. Will optionally save the best model to disk.
evaluate Evaluate a trained model on Prodigy annotations and print the accuracy.