HeavyWater-Project

Project aim:

Train a document classification model. Deploy the model to a public cloud platform.

Data:

Contains two columns - document label and hashed document content

Original Document Distribution

Total Documents:62204

Since the document distribution is uneven I balance the dataset by undersampling, which reduces the number of documents to be preocessed and computation power required

Undersampled Document Distribution

Total Documents:3206

Model:

I used (Multinomial) Naive Bayes model available in the scikit-learn. Trained the document on the undersampled document set and tested the model with various different sizes of test data.

When splitting the undersampled document set into 80%-train 20%-test, the model predicts the test data with 75% accuracy. For larger test data sizes the model consistently provides an accurracy of 75%.

It is important to test different ML models(like Neural Nets, Random Forest etc) when devising a solution to a problem using ML. I did not check other models due to time constraint.

Deployment:

The prediction function is deployed on AWS lambda, and the model and required packages for the functions are stored on AWS S3.

Steps Taken:

Undersampling Data
Creating of features vectors using tfidf vectorizer
Training the model on undersampled data
Saving the model and feature vocabulary to be used in the prediction function
Deploying the function to AWS lamda (Deploying a function to lambda involves building required libraries for prediction function from source in the AWS linux environment. A docker image of AWS linux env can be downloaded here. The built libraries and lamda_function need to be uploaded on S3 as a .zip file)
Creating API on AWS API GATEWAY
Building an UI for submitting requests to the API (uses Django)

Improvements

The classification can be improved by removing stop-words before hashing the raw text output from OCR layer.
A list of stop-words can be formed in context with the domain in which this ML model will be used, I believe many stop-words might exist in financial documents that barely helps to distinguish between different classes of documents.
Trying multiple models and parameter tuning will help us to use the best suitable model.

scikit-learn package built for AWS lambda (amazonlinux2018.03) download

Contact me for links to the UI and API

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
README.md		README.md
Screenshot (189).png		Screenshot (189).png
Screenshot (190).png		Screenshot (190).png
clf_model05.pkl		clf_model05.pkl
features05.pkl		features05.pkl
lambda_function_github.py		lambda_function_github.py
model_training.py		model_training.py
random_undersampler.py		random_undersampler.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HeavyWater-Project

About

Uh oh!

Releases

Packages

Languages

virajgite/HeavyWater-Project

Folders and files

Latest commit

History

Repository files navigation

HeavyWater-Project

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages