Skip to content

Commit 5f79780

Browse files
authored
Update readme.md
1 parent 4c11c31 commit 5f79780

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

readme.md

+6-6
Original file line numberDiff line numberDiff line change
@@ -3,23 +3,23 @@
33
## Title: Deep Language Model Representation of Document Clustering
44

55
### Abstract :
6-
Powerful document clustering models are important as they are able to efficiently process large sets of documents. These models can be useful in many fields, including general research. Searching through large corpora of publications can be a slow and tedious task; such models can reduce the time of this significantly. We investigated different variations of a pre-trained BERT model to find which is best able to produce word embeddings to represent documents within a larger corpus. These embeddings are reduced in dimensionality using PCA and clustered with K-Means in order to gain insight into which model is able to best differentiate the topics within a corpus. It was found that out of the tested BERT variations, SBERT was the best model for this task.
6+
Powerful document clustering models are essential as they can efficiently process large sets of documents. These models can be helpful in many fields, including general research. Searching through large corpora of publications can be a slow and tedious task; such models can significantly reduce this time. We investigated different variations of a pre-trained BERT model to find which is best able to produce word embeddings to represent documents within a larger corpus. These embeddings are reduced in dimensionality using PCA and clustered with K-Means to gain insight into which model can best differentiate the topics within a corpus. We found that SBERT was the best model for this task out of the tested BERT variations.
77

88

99

1010
### Code Implementations:
1111
* Prerequisites:
1212
* Python 3.7 or later
13-
* Anaconda
1413
* Jupyter Notebook
1514

1615

1716
* Dependencies:
18-
The project uses multiple python libraries which are required to run this code. To install the code please run below code snippit in anaconda prompt.
17+
The project uses multiple python libraries, which are required to run this code. To install the code, please run the below code snippet in the anaconda prompt.
1918

2019
`pip install -r requirements.txt`
21-
22-
* NLP_Final_Project_Code.ipynb ** Note: throughout this file, we import word embeddings for each model in its corresponding file in the data/ folder. These are the word embeddings produced by each model during testing, we import them from a saved file as generating them from the model itself uses a lot of memory (8GB+), and may risk crashing the testers computer.
2320

21+
* Python Notebook: There are two python notebooks :[1] NLP_Final_Project_Code.ipynb and [2] BERT Cosine Similarity Test.ipynb
2422

25-
* BERT_base_knowledge_colab.ipynb
23+
* The NLP_Final_Project_Code.ipynb contains the code base for evaluating the BERT textual embeddings for clustering. We have used PCA for dimensionality reduction and K-Means for clustering. The embeddings are calculated separately and stored in the CSV file in the **./data** folder.
24+
25+
* In BERT Cosine Similarity Test.ipynb, we are testing the ability of BERT embedding to capture the similarity between the documents. For this, we manually grouped files based on their content 1) Group of similar files and 2) Group of dissimilar files. Then we measured the cosine similarity between each group. We hypothesized that BERT embeddings could detect similarities among the document based on their pretrained representation. We also evaluated SBERT, which proved to provide a better representation than BERT's different variants.

0 commit comments

Comments
 (0)