Document Classification in Python
A tutorial showing how to leverage a few great libraries out there -- gensim and scikit-learn -- to not only perform document similarity queries, but document classification as well.
corpus -- A directory of 4 tiny text files
.gitignore -- Files in repo for Git to ignore
classifier.py -- The main file that does everything
requirements.txt -- File used by pip to download dependencies
All you need to do is clone the repo:
git clone https://github.com/Scripted/NLP-Tutorial
In a perfect world, running "pip install -r requirements.txt" should download all the dependencies necessary to run this code. Unfortunately, Numpy and Scipy don't always play nice with pip. So try "pip install -r requirements.txt" and if that doesn't work, check out the installation instructions on the modules' sites: Numpy , Scipy , Gensim , Scikit-Learn
Easy enough:
python classifier.py
The output shows the various steps of the algorithm as it works.