Project aim:
Train a document classification model. Deploy the model to a public cloud platform.
Data:
Contains two columns - document label and hashed document content
Original Document Distribution
Total Documents:62204
Since the document distribution is uneven I balance the dataset by undersampling, which reduces the number of documents to be preocessed and computation power required
Undersampled Document Distribution
Total Documents:3206
Model:
I used (Multinomial) Naive Bayes model available in the scikit-learn. Trained the document on the undersampled document set and tested the model with various different sizes of test data.
When splitting the undersampled document set into 80%-train 20%-test, the model predicts the test data with 75% accuracy. For larger test data sizes the model consistently provides an accurracy of 75%.
It is important to test different ML models(like Neural Nets, Random Forest etc) when devising a solution to a problem using ML. I did not check other models due to time constraint.
Deployment:
The prediction function is deployed on AWS lambda, and the model and required packages for the functions are stored on AWS S3.
Steps Taken:
- Undersampling Data
- Creating of features vectors using tfidf vectorizer
- Training the model on undersampled data
- Saving the model and feature vocabulary to be used in the prediction function
- Deploying the function to AWS lamda (Deploying a function to lambda involves building required libraries for prediction function from source in the AWS linux environment. A docker image of AWS linux env can be downloaded here. The built libraries and lamda_function need to be uploaded on S3 as a .zip file)
- Creating API on AWS API GATEWAY
- Building an UI for submitting requests to the API (uses Django)
Improvements
- The classification can be improved by removing stop-words before hashing the raw text output from OCR layer.
- A list of stop-words can be formed in context with the domain in which this ML model will be used, I believe many stop-words might exist in financial documents that barely helps to distinguish between different classes of documents.
- Trying multiple models and parameter tuning will help us to use the best suitable model.
scikit-learn package built for AWS lambda (amazonlinux2018.03) download
Contact me for links to the UI and API