Welcome to the NSF Research Awards Abstracts project! π This repository contains my solution for clustering abstracts into topics based on their semantic similarity using unsupervised learning techniques.
config
β βββ model
β β βββ model.yaml
β βββ process
β β βββ preprocessing.yaml
β βββ main.yaml
data
β βββ raw
β βββ refined
β βββ trusted
docs
models
notebooks
β βββ analysis.ipynb
src
β βββ pycache
β βββ mlruns
β βββ outputs
β βββ main.py
β βββ model.py
β βββ processing.py
tests
.gitignore
.pre-commit-config.yaml
Makefile
poetry.lock
pyproject.toml
README.md
To get started with the project, please follow these steps:
- Clone this repository:
git clone https://github.com/Ramseths/nsf_research_awards.git
- Install the dependencies:
poetry install
Using Data Lake Architecture.
The data for this project consists of several paper abstracts provided by the NSF (National Science Foundation). The abstracts are stored in the data/raw
directory.
In this project I used a combination of traditional and state-of-the-art NLP techniques to uncover themes in the abstracts. For example the main approach is the use of LDA (Latent Dirichlet Allocation) and on the other hand, combination of Embeddings plus KMeans.
-
Data Preprocessing:
- Cleaned and preprocessed the text data to remove unnecessary fields.
- Tokenized the text and removed stopwords.
-
Feature Extraction:
- Used TF-IDF and word embeddings to convert text into numerical features.
-
Modeling:
- Applied clustering algorithms such as K-Means to group similar abstracts.
- Utilized topic modeling techniques like LDA (Latent Dirichlet Allocation) for discovering topics.
-
Evaluation:
- Analyzed the resulting clusters and topics to understand their coherence and relevance.
The results of clustering and theme modeling can be found in the notebooks/analysis.ipynb
notebook. Although the results are not expected to be perfect, they provide a better understanding of the abstract themes and show the application of various NLP techniques. In addition, the results are deposited in the refined data layer (simulating Data Lake architecture).
The trained models and their configurations are saved in the models
directory. You can load and evaluate these models using the provided scripts in the src
directory.
To run the main file, use the following command:
cd src
python main.py
Happy clustering!