ReadMe.md
(you are here)- Covers details on Installing requirements, and initializing dataset.
- Then redirects you to
ResearchSteps.md
file.
ResearchSteps.md
- Covers all steps from literature review, data preprocessing, EDA and insights.
- Then redirects you to
HyperparameterTuning.md
file.
HyperparameterTuning.md
- Covers Hyperparameters Tuning in details, covering several studies and including decisions to continue with certain hyperparameters, for both Topic Model and Classifier.
- Then redirects you back to this page to view Evaluations details, where later sections continue with inference app, conclusions, and more.
-
The repository top level has
apps
,logs
,models
,dataset
,resources
, and some essential files including this documentation.apps
is where trainers, and services are keptbertopic_trainer
- used to fine-tune and train BERTopic modelsclassifier
- used to fine-tune and train a supervised classifierinference_service
- used for classifier inferenceinit_database
- service to write all arxiv documents to MongoDB. Not fully utilized used at the moment.notebooks
- additional jupyter notebooks
models
contains MultilabelBinarizer, and best_model.bin files under respective finetuning_studies. (Only 2 saved to save repo size)dataset
this is not included, but is a placeholder for arxiv original dataset file, and corresponding dataset splits that are created as part of this study.resources
contains images used within documentation.
-
Directory Tree Structure is as follows (skipping individual files):
arxiv_dataset_insights (project_root) ├── apps │ ├── bertopic_trainer │ │ ├── config │ │ ├── logs │ │ └── src │ ├── classifier │ │ ├── config │ │ ├── logs │ │ └── src │ ├── inference_service │ │ ├── config │ │ ├── logs │ │ └── src │ ├── init_database │ │ └── src │ └── notebooks │ └── src ├── dataset **(not included)** │ ├── cache_dir │ └── parquet ├── logs ├── models │ └── finetuning_studies │ ├── classifier_dataset4_study1 │ ├── distilled_model_selection_sample_dataset5_3 └── resources
- Python source code files.
- models directory with essential models
- Best models used for inference.
- Best models for which evaluation scores are reported.
- MultiLabelBinarier file.
- All notebook files, including the partially abandoned ones.
- Essential log files for review.
- Optuna study storage files - ending with .sql extension.
- requirements.txt files.
- .gitignore file.
- Dockerfile's and docker-compose.yml file.
- Documentation files - ReadMe.md, ResearchSteps.md, HyperparameterTuning.md
- Embeddings files ending with .npy extensions are not included, but they are generated from scratch
if they don't exist.
- They are only helpful if training / fine-tuning from scratch. Inference directly calls transformers model.
- Dataset files - the original arxiv dataset json file, and 5 datasets with different split ratios that I generated are not included. Mainly the reason being, the files are too big. Although, EDA Notebook in this section can be used to generate datasets. Please note that, your results could differ due to random sampling from each category.
- Best model for Topic Modeling task, because it was concluded as not getting enough quality out of it, and additionally facing numerous issues like with reproduction, and inability to process bigger datasets due to cuML's algorithm limitations etc.
- Models for all trials, except for the best classifier trial.
- Description
- Assumed Requirements
- Initialize Dataset
- Research Steps
- Running Classifier Evaluations
- Inference App / Endpoint
- Conclusions
- Issues Faced
Objectives as part of this study are to find Optimal Number of Categories, and then build a Classifier to classify Arxiv research articles into predefined categories, while both objectives focus only on using paper abstracts from Arxiv dataset for feature extraction.
- All the requirements to run the corresponding submodules are provided within requirements.txt file at the root of
that module.
- Hence, consider creating python virtual environment and install all the requirements from requirements.txt file.
- [skip this step] Docker is installed, and appropriate permissions are granted for running this project.
- Plan to use docker for everything was dropped because I couldn't fully figure out how to access GPU from docker instance.
- Using things through docker containers would work, but not using GPU would slow things down significantly.
-
Download and unzip dataset and place it in
dataset
directory- Dataset url
https://www.kaggle.com/datasets/Cornell-University/arxiv?resource=download
- Place
arxiv-metadata-oai-snapshot.json
file indataset
directory
-
Initialize Mongodb collection, so that it's easy to access small portion of dataset as needed rather than loading everything in memory.
- Run init_database service, wait for it to finish
docker-compose up init_database
Research steps are comprehensively covered in file ResearchSteps.md Don't worry, you'll be redirected to section after this one.
-
Classifier Evaluator program is available at location -
<project_root>/apps/classifier/src/classifier_evaluator.py
-
Evaluator uses
yaml
config file -
Here's how you run the example evaluations
# example for dataset4
# assuming you are at the project root which is inside <arxiv_dataset_insights> directory
cd apps/classifier/src/
python classifier_evaluator.py ../config/evaluator_dataset5.yml
- Some evaluator results can be seen in these log files:
- Evaluations for dataset 4 -
./apps/classifier/src/logs/classifier_evaluator_dataset4-1762023-195325.txt
- Evaluations for dataset 5 -
./apps/classifier/src/logs/classifier_evaluator_dataset5-1762023-195732.txt
- Evaluations for dataset 4 -
logging_dir: "../logs"
logfile_name: "classifier_evaluator_dataset5"
logging_level: "INFO"
dataset_path: "../../../dataset"
models_path: "../../../models"
# This indicates the model chosen for evaluations is the best
# model saved during specified Optuna hyperparameter tuning study.
optuna_study_name: "classifier_dataset4_study1"
dataset_index: 5
# model settings
label_transformer: "multilabel_binarizer.pkl"
transformer_model_name: 'sentence-transformers/distilroberta-base-paraphrase-v1'
num_classes: 176
input_size: 768
batch_size: 1024
optuna_study_name
: This indicates the model chosen for evaluations is the best model saved during specified Optuna hyperparameter tuning study.dataset_index
- Checkout section Dataset Creation on how datasets are created.label_transformer
- name of label_transformer that is present with the same name undermodels_path
directory. This model is of typesklearn.preprocessing.MultiLabelBinarizer
and supports 176 unique classes extracted fromcategories
from the original arxiv dataset.
The best model selected for this test was from the following Optuna study:
- Optuna storage file:
classifier_dataset4_finetuning.sql
- Study name:
classifier_dataset4_study1
- Best Trial:
#15
- Trained on
Dataset 4 train split
- Binary Crossentropy Loss reported on validation set during trial -
0.02078
Threshold | Top-1-Overlap Accuracy | |
---|---|---|
Train | ||
0.05 | 0.9570 | |
0.1 | 0.9239 | |
0.2 | 0.8627 | |
0.3 | 0.8017 | |
Validation | ||
0.05 | 0.9434 | |
0.1 | 0.9044 | |
0.2 | 0.8369 | |
0.3 | 0.7737 | |
Test | ||
0.05 | 0.9432 | |
0.1 | 0.9044 | |
0.2 | 0.8373 | |
0.3 | 0.7738 |
These results on based on straight inference on all splits using the same best model as stated above.
Threshold | Top-1-Overlap Accuracy | |
---|---|---|
Train (inference only) | ||
0.05 | 0.9568 | |
0.1 | 0.9236 | |
0.2 | 0.8632 | |
0.3 | 0.8016 | |
Validation | ||
0.05 | 0.9533 | |
0.1 | 0.9185 | |
0.2 | 0.8532 | |
0.3 | 0.7897 | |
Test | ||
0.05 | 0.9527 | |
0.1 | 0.9178 | |
0.2 | 0.8531 | |
0.3 | 0.7919 |
The best model selected for this test was from the following Optuna study:
- Optuna storage file:
distilled_model_choice_experiments_dataset5.sql
- Study name:
distilled_model_selection_sample_dataset5_3
trial #74
- Trained on Dataset 5
- Coherence Score reported on validation set during trial - 0.6642
Split | Coherence Score |
---|---|
train | 0.6228 |
validation | 0.5875 |
test | 0.6066 |
Split | Coherence Score |
---|---|
train | 0.5802 |
validation | 0.5678 |
test | 0.5569 |
- For the best trial with coherence score of 0.6642 on dataset 5 validation split,
nr_topics
was set to1200
, although the model only returns 7 topics. - While, that's true about
nr_topics
, the same best model saved fromtrial #74
still reported Coherence Score of 0.5678, contrasting 0.6642 reported on the same split during evaluation after training, which reconfirms the reproducibility issues.
- Inference app is served using Flask.
- Key endpoint served is
class_for_abstract
, where body should pass argumentabstract
key with abstract string as a value. - It can be accessed from the browser or Postman app, or can be called through programs using the REST API.
- Key endpoint served is
- Running Inference App locally using python
cd <project_root>/apps/inference_service/src
python inference_app.py ../config/inference_config.yml
- Sending Request through Postman or from web browser
# you can use this example
http://localhost:5000/class_for_abstract?abstract= A fully differential calculation in perturbative quantum chromodynamics is
presented for the production of massive photon pairs at hadron colliders. All
next-to-leading order perturbative contributions from quark-antiquark,
gluon-(anti)quark, and gluon-gluon subprocesses are included, as well as
all-orders resummation of initial-state gluon radiation valid at
next-to-next-to-leading logarithmic accuracy. The region of phase space is
specified in which the calculation is most reliable. Good agreement is
demonstrated with data from the Fermilab Tevatron, and predictions are made for
more detailed tests with CDF and DO data. Predictions are shown for
distributions of diphoton pairs produced at the energy of the Large Hadron
Collider (LHC). Distributions of the diphoton pairs from the decay of a Higgs
boson are contrasted with those produced from QCD processes at the LHC, showing
that enhanced sensitivity to the signal can be obtained with judicious
selection of events.
-
The image shows successful completion of request within Postman app along with appropriate predicted_labels values.
-
Running inference app using
docker-compose
cd <project_root> docker-compose up -d inference_app
- Once docker container is up and running, you try the same endpoint as stated above.
- Although, I selected 1200 as optimal number of categories, the reproduction attempts resulted in almost 6% reduction in Coherence Score, which was the only metric used for best hyperparameters selection.
- As Coherence scores consistently peaked for
number of categories
around 15, 800, and 1200, they consistently stayed below certain scores for all other test ranges.- Based on Top level unique categories study from EDA, the number 15 as optimal number of categories makes sense for broader level of categories. Although, there's clear opportunity for further granular level fine-tuning within that local range.
- Further granular level fine-tuning is also necessary within 800-1200 local range, although, I didn't focus on ways to justify and well distinguish the presence of number of categories as high as 800 to 1200. And I don't think topic modeling would be any appropriate way to explain it, specifically not BERTopic, which only merges top N words to create a name for a topic. One would need to thoroughly explain what's in them concept wise, perhaps that's where summarization based approaches would help.
- Data Exploration shows tree hierarchies of categories, where hierarchies aren't followed consistently across category labels for all samples. And hence signalling that hierarchical topic modeling may not be the only one among possible best approaches for this problem.
- Paraphrase model outperformed every other model in terms of Cohere Score. That not only demonstrates the power of
paraphrasing model of capturing the finer concept-level details, but also the power of bringing similar words from
multiple languages on similar vector space, while the fine-tuning was done on majorly Wikipedia-based dataset.
- There are lots of good things contributing positively for this model, and hence, further deeper study into every aspect could reveal interesting details and insights about each of them.
- BERTopic is an amazingly designed framework that attempts to combine the best of Embeddings world, Dimensionality
Reduction, and Clustering.
- Although, the technique still brings it all down to the traditional tf-idf level. Even if they reference it as class / cluster level tf-idf, (c-tf-idf), the contextual word meanings are almost ignored at the top level by extracting the document level embeddings. And then further they are ignored again at the bottom level by simply using Count Vectorizer and c-tf-idf.
- Further, by initially just extracting document level embeddings, and then combining all documents within the same
cluster into a single document, BERTopic gives up on certain finer level details such as
Word-level Granularity
,Localized Topic Specificity
,Individual Word Interpretability
, andDocument Level Context
. Such details are rather well addressed by traditional topic modeling techniques such as LDA. - These missing aspects are also very crucial for hierarchical topic modeling, and hence, could limit our ability to extensively study hierarchies in Arxiv dataset abstracts.
- In order to improve the pace of research in contextual embeddings for bigger datasets, libraries like cuML that
offer GPU based implementations for key algorithms of Dimensionality Reduction, and Clustering need active support for
processing data in batches rather than reserving the GPU memory until the end of algorithm execution. Currently, in my
opinion, these libraries aren't ready for bigger datasets, and hence, custom algorithms need to be implemented on top
of existing implementations to explicitly deal with issues such
as
Cluster Fragmentation
,Instability
,Concept Drift
etc. depending on which algorithms are used. - Due to limitations with Dimensionality Reduction and Clustering algorithms excessive GPU memory footprint, given the size of bigger datasets. I only used dataset 5, which is only 5% of the original data within each split. Hence, even though I have used stratified data selection strategy to select data across all unique 176 known categories, the chosen optimal number of categories might differ for datasets 1 to 4, which comprise 40% to as big as 80% data of original Arxiv dataset.
- Classifier shows great accuracy scores when considering Top-1-Overlapping labels accuracy at lower probability thresholds. Although that accuracy metric starts dropping as we move up the number of words overlap to 2 and beyond. Part of the reason mainly being the inconsistent nature of level associations to category labels as shown in EDA. Although, it wasn't thoroughly studied why such inconsistencies existed in the first place, perhaps due to the complex nature of true topic association itself, which again raises question on hierarchical topic allocation being the only possible solution.
- Final big issue with BERTopic is that the paper says they reduce topics down to the value specified by
nr_topics
argument. In contrast, the number of topics returned were nowhere closer to thenr_topics
value, and rather were significantly smaller for all of my studies. - Even though I'm concluding BERTopic as a failed study due to several issues, the steps until HDBSCAN clustering are still reliable, and hence, I can pick up output from clustering as one of the possible inputs for my own study on contextual topic modeling.
- Installation related issues
- Issue with mounting GPU to docker container [unresolved]
- Issues with extracting Embeddings for Clustering, the model, size, and time limit considerations, computations limits on local machine
- Issues while installing cuML library
- Issues faced while trying evaluation metrics from OCTIS
- GPU memory leaks and inability to proceed with fine-tuning of larger datasets I selected - datasets 1, 2, 3, 4
- Although I did manage to train 1 model on dataset 1, for which, the training data is approx 40% of the original
dataset.
- That too was achieved by only reducing precision of embeddings to float16.
- Clearing GPU context memory after intermediate steps of dimensionality reduction, and clustering, but not as much for bigger datasets.
- Although I did manage to train 1 model on dataset 1, for which, the training data is approx 40% of the original
dataset.
- Difficulty in reproducing BERTopic results, at least in my case. This is potentially due to using GPU based computations across 80% processing.