Skip to content

NSF Research Awards Abstracts Welcome to the NSF Research Awards Abstracts project! πŸŽ‰ This repository contains my solution for clustering abstracts into topics based on their semantic similarity using unsupervised learning techniques.

License

Notifications You must be signed in to change notification settings

Ramseths/nsf_research_awards

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

18 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ NSF Research Awards Abstracts

Welcome to the NSF Research Awards Abstracts project! πŸŽ‰ This repository contains my solution for clustering abstracts into topics based on their semantic similarity using unsupervised learning techniques.

πŸ—‚οΈ Project Structure (using Cookiecutter πŸͺ)

config
β”‚ β”œβ”€β”€ model
β”‚ β”‚ └── model.yaml
β”‚ └── process
β”‚ β”‚ └── preprocessing.yaml
β”‚ └── main.yaml
data
β”‚ β”œβ”€β”€ raw
β”‚ β”œβ”€β”€ refined
β”‚ └── trusted
docs
models
notebooks
β”‚ └── analysis.ipynb
src
β”‚ β”œβ”€β”€ pycache
β”‚ β”œβ”€β”€ mlruns
β”‚ β”œβ”€β”€ outputs
β”‚ β”œβ”€β”€ main.py
β”‚ β”œβ”€β”€ model.py
β”‚ └── processing.py
tests
.gitignore
.pre-commit-config.yaml
Makefile
poetry.lock
pyproject.toml
README.md

πŸ› οΈ Setup

To get started with the project, please follow these steps:

  1. Clone this repository:
    git clone https://github.com/Ramseths/nsf_research_awards.git
  2. Install the dependencies:
    poetry install

πŸ“Š Data

Using Data Lake Architecture.

The data for this project consists of several paper abstracts provided by the NSF (National Science Foundation). The abstracts are stored in the data/raw directory.

πŸš€ Approach

In this project I used a combination of traditional and state-of-the-art NLP techniques to uncover themes in the abstracts. For example the main approach is the use of LDA (Latent Dirichlet Allocation) and on the other hand, combination of Embeddings plus KMeans.

  1. Data Preprocessing:

    • Cleaned and preprocessed the text data to remove unnecessary fields.
    • Tokenized the text and removed stopwords.
  2. Feature Extraction:

    • Used TF-IDF and word embeddings to convert text into numerical features.
  3. Modeling:

    • Applied clustering algorithms such as K-Means to group similar abstracts.
    • Utilized topic modeling techniques like LDA (Latent Dirichlet Allocation) for discovering topics.
  4. Evaluation:

    • Analyzed the resulting clusters and topics to understand their coherence and relevance.

πŸ“ˆ Results

The results of clustering and theme modeling can be found in the notebooks/analysis.ipynb notebook. Although the results are not expected to be perfect, they provide a better understanding of the abstract themes and show the application of various NLP techniques. In addition, the results are deposited in the refined data layer (simulating Data Lake architecture).

πŸ€– Models

The trained models and their configurations are saved in the models directory. You can load and evaluate these models using the provided scripts in the src directory.

πŸ§ͺ Run project

To run the main file, use the following command:

cd src
python main.py

Happy clustering!

About

NSF Research Awards Abstracts Welcome to the NSF Research Awards Abstracts project! πŸŽ‰ This repository contains my solution for clustering abstracts into topics based on their semantic similarity using unsupervised learning techniques.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published