Multilabel Task Classifier from Paper Abstract

From Paper With Tasks

A text classification model from data collection, model training, and deployment.
The model can classify 260 different types of paper tasks
The keys of json_files/task_types_encoded.json shows the paper tasks

Data Collection

Data was collected from paperswithcode

Data was collected from the categoreis below:

Computer Vision
- Convolutional Neural Networks
- Generative Models
- Image Model Blocks
- Object Detections Models
- Image Feature Extractors
Natural Language Processing
- Language Models
- Transformers
- Word Embeddings
- Attention Patterns
- Sentence Embeddings
Reinforcement Learning
- Policy Gradient Methods
- Off-Policy TD Control
- Reinforcement Learning Frameworks
- Q-Learning Networks
- Value Function Estimation
Audio
- Generative Audio Models
- Audio Model Blocks
- Text-to-Speech Models
- Speech Separations Models
- Speech Recognition
Sequential
- Recurrent Neural Networks
- Sequence to Sequence Models
- Time Series Analysis
- Temporal Convolutions
- Bidirectional Recurrent Neural Networks
Graphs
- Graph Models
- Graph Embeddings
- Graph Representation Learning
- Graph Data Augmentation

The scripts I've used to scrape the data can be found in the scrapers directory.

In total, I scraped 34k+ paper abstracts and other informations.

Data Processing

Initially there were 2186 different tasks in the dataset. After some analysis, I found out 1926 of them are rare (They showed up less than 30 times in the dataset). So, I removed those tasks making the tasks count equals to 260. After that, I removed the description without any tasks. I also removed duplicate rows and cases where there were no task(s) provided. So, the resulting dataset contained total of 16304 samples.

The papersWithCode_data.csv is the generated dataset after the scraping. Which can be found inside the csv_files directory

Modeling

Finetuned a distilrobera-base model from HuggingFace Transformers using Fastai and Blurr. The model training notebook can be viewed here

Also, checkout other notebooks in the notebooks directory.

Model Compression & ONNX Inference

The trained model has a memory of 400+MB. I compressed this model using ONNX quantization and brought it under 85MB.

Deployment

The compressed model is deployed to HuggingFace Spaces Gradio App. The implementation can be found in deployment folder or here

Web Deployment

Deployed a Flask App built to take abstract and show the tasks of the paper as output. Check flask branch. The website is live here

*Background Image Credit: The image used as the background is not mine. It was taken from here

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
csv_files		csv_files
deployment		deployment
json_files		json_files
notebooks		notebooks
pickle_files		pickle_files
scrapers		scrapers
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multilabel Task Classifier from Paper Abstract

From Paper With Tasks

Data Collection

Data Processing

Modeling

Model Compression & ONNX Inference

Deployment

Web Deployment

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Tasfiq-K/from-paper-with-tasks

Folders and files

Latest commit

History

Repository files navigation

Multilabel Task Classifier from Paper Abstract

From Paper With Tasks

Data Collection

Data Processing

Modeling

Model Compression & ONNX Inference

Deployment

Web Deployment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages