This project implements a semi-supervised approach to classify UN speeches.
We have implemented this approach in 2 ways:
The method is best illustrated with the following diagram:
- Generate word embeddings using BERT Sentence Transformer
- Generate a graph using cosine similarity for edges and sentence as the node
- Generate embeddings using Node2Vec
- Train a Neural Network to classify into topics using graph embeddings.
The flowchart illustrating this approach:
- Generate word embeddings using BERT Sentence Transformer
- Train a Neural Network (N1) on these embeddings
- Pseudo-label data using N1
- Stack labelled and pseudo-labelled data
- Train a more complex Neural Network (N2)
Read the Detailed Report for further information.
The dataset for this project contains approximately 2 million sentences from UN General Debate speeches held from 1970 to 2016.
A sample of the dataset is saved as csv files in this repo. The original is publicly available on the Harvard Dataverse and on my GDrive.
- Download the dataset from the above GDrive link and unzip it into
data
folder - Execute the
preprocess.py
file - Execute either:
-
approach1.py
to train the model using the first approachOR
-
approach2.py
to train the model using the second approach
-
- Download the contents from the above GDrive link
- Put the csv files in the
data
folder - Put everything else in the
weights
folder - For inference, execute
inference2.py
to use the saved weights from the second approach. The program will ask for an input sentence and will output the predicted class.
- pandas==2.2.1
- numpy==1.26.4
- nltk==3.8.1
- maptlotlib==3.8.3
- sentence_transformers==2.5.1
- tensorflow==2.16.1
- gensim==4.3.2
- node2vec==0.4.6
- Yash Jain
- Abhinav Shukla