A comprehensive machine learning pipeline for classifying research papers into scientific disciplines using a fine-tuned BERT model. This system includes web scraping capabilities, data preprocessing, and a high-accuracy classification model.
The system classifies research papers into 9 major scientific categories with detailed subfields:
| Category | Subfields |
|---|---|
| Computer Science | Artificial Intelligence, Cloud Computing, Cyber Security, Quantum Computing, Software Engineering |
| Medicine | Cardiology, Surgery, Neurology, Oncology, Pediatrics, Psychiatry |
| Business | Marketing, Finance, Management, Entrepreneurship, E-commerce |
| Biology | Genetics, Ecology, Zoology, Biochemistry, Physiology |
| Physics | Quantum Mechanics, Astrophysics, Particle Physics, Cosmology |
| Chemistry | Organic Chemistry, Analytical Chemistry, Biochemistry, Medicinal Chemistry |
| Mathematics | Algebra, Calculus, Statistics, Probability, Optimization |
| Psychology | Clinical Psychology, Cognitive Psychology, Social Psychology, Neuropsychology |
| Environmental Science | Climate Change, Conservation, Sustainability, Renewable Energy |
- Original Records: 155,882 research papers
- Clean Non-duplicate Records: 140,004 papers
- Cleaning Ratio: 89.81%
- Psychology: 16,821 records (12.0%)
- Chemistry: 16,675 records (11.9%)
- Physics: 15,941 records (11.4%)
- Business: 15,929 records (11.4%)
- Mathematics: 15,464 records (11.0%)
- Medicine: 15,361 records (11.0%)
- Computer Science: 14,776 records (10.6%)
- Biology: 14,729 records (10.5%)
- Environmental Science: 14,308 records (10.2%)
- Google Colab (recommended) or Jupyter Notebook
- Google Drive account for data storage
- Clone this repository to your Colab environment:
!git clone https://github.com/Emran025/Research_Paper_Classification_model.git
%cd Research_Paper_Classification_model- Download all project assets using our downloader notebook:
Open and run gdrive_folder_downloader.ipynb to download:
- Pre-trained BERT model (
bert_text_classifier) - Training checkpoints and logs
- Cleaned dataset (
cleaned_dataset.csv) - Raw scraped research papers
- You're ready to go! All project components are now available locally.
Research_Paper_Classification_model/
├── gdrive_folder_downloader.ipynb # Download project assets from Google Drive
├── mendeley_scraper.ipynb # Scrape research papers from Mendeley
├── research_papers_bert_finetuning.ipynb # BERT model training pipeline
├── bert_text_classifier/ # Pre-trained model (downloaded)
├── bert_classification_output/ # Training artifacts (downloaded)
├── cleaned_dataset.csv # Processed dataset (downloaded)
└── Mendeley_Research/ # Raw scraped data (downloaded)
After running the downloader notebook, you'll have immediate access to:
- Fine-tuned BERT model (95.39% accuracy)
- Pre-processed dataset (140,004 research papers)
- Training logs and checkpoints
For complete reproducibility:
- Data Collection: Run
mendeley_scraper.ipynbto scrape fresh data from Mendeley - Model Training: Execute
research_papers_bert_finetuning.ipynbto fine-tune BERT - Evaluation: Use the trained model for inference and analysis
The fine-tuned BERT model achieves state-of-the-art performance:
- Evaluation Loss: 0.184
- Overall Accuracy: 95.39%
- Evaluation Runtime: 428.03 seconds
- Samples Processed: 65.306 samples/second
| Category | Precision | Recall | F1-Score | Support |
|---|---|---|---|---|
| Biology | 0.94 | 0.93 | 0.94 | 3,177 |
| Business | 0.96 | 0.97 | 0.97 | 3,179 |
| Chemistry | 0.94 | 0.96 | 0.95 | 3,073 |
| Computer Science | 0.96 | 0.93 | 0.95 | 2,987 |
| Environmental Science | 0.95 | 0.94 | 0.95 | 2,850 |
| Mathematics | 0.93 | 0.96 | 0.95 | 3,091 |
| Medicine | 0.97 | 0.96 | 0.96 | 3,067 |
| Physics | 0.97 | 0.95 | 0.96 | 3,181 |
| Psychology | 0.97 | 0.97 | 0.97 | 3,348 |
The model is available on Hugging Face Hub for easy integration:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("Emran025/bert_text_classifier")
model = AutoModelForSequenceClassification.from_pretrained("Emran025/bert_text_classifier")Model Card: https://huggingface.co/Emran025/bert_text_classifier
All datasets and model artifacts are available via Google Drive:
- Main Drive Folder: Google Drive Link
- Raw datasets organized by scientific discipline
- Pre-trained model and training checkpoints
- Cleaned and processed datasets
Data was collected from Mendeley research catalog across 9 major disciplines:
- Computer Science, Medicine, Business, Chemistry, Mathematics
- Psychology, Environmental Science, Biology, Physics
Each category contains multiple specialized subfields for comprehensive coverage.
- Framework: PyTorch with Transformers
- Base Model: BERT (bert-base-uncased)
- Training Environment: Google Colab
- Dataset Size: 140,004 research papers (after cleaning)
- Classes: 9 scientific disciplines with 5-12 subfields each
- Training Samples: 27,953 papers (evaluation set)
This project is optimized for Google Colab. All file paths and dependencies are configured for Colab's environment. When running locally, adjust the file paths accordingly.
Contributors: Emran Nasser, Mohammed Alyafrosy, Ryadh Alizi
License: MIT
Last Updated: October 2024