This repository was used in the Text Classification using Machine Learning session at Lancaster Summer Schools in Corpus Linguistics and other Digital methods #LancsSS16 and #LancsSS17 at Lancaster University, UK – 12th to 15th July 2016 and 27th - 30th June 2017. http://ucrel.lancs.ac.uk/summerschool/nlp.php
Insttructor: Dr. Mahmoud El-Haj http://www.lancaster.ac.uk/staff/elhaj
Slides are avialable online here:
Course: https://lancaster.box.com/s/fi15evvbtcs4ab0tx5zo8nxmy2yylztx
Workspace Setup: https://lancaster.box.com/s/j78l0b4197il98oze2gfqlidlsvg7jlt
The code trains classifiers for chairman's statements, governance & remuneration sections from 1,000 annual financial reports. Using WEKA Java the code does the following:
- Creates an ARFF File
- Train a model using different Algorithms
- Extract n-gram features using stringToWordsVector
- Reduce features
- Classify unseen documents using the created models.