This project performs topic modeling on Reddit posts using BERTopic. It retrieves and processes data from Reddit, applies topic modeling, and visualizes key topics discussed within a subreddit. π
- β Data Retrieval: Extracts posts and comments from Reddit π₯
- β Text Preprocessing: Cleans and prepares text for analysis π οΈ
- β Topic Modelling: Uses BERTopic to identify key discussion topics π―
- β Visualization: Generates insightful visual representations of the extracted topics π
- β Labelling Docs: Generates a well labelled Json file for your Docs file, segregating each document to its topic (example at the end)
π Reddit Project
β-- π README.md
β-- π Data_retrieval.ipynb # Fetches Reddit data using API
β-- π main.ipynb # Preprocessing & Topic Modeling
β-- π data # Stores raw & processed data
β-- π models # Trained topic models
β-- π visualizations # Graphs and topic word clouds
git clone https://github.com/Nikhil-1705/reddit_topic_modelling.git
cd reddit_topic_modelling
python -m venv venv
source venv/bin/activate # Mac/Linux
venv\Scripts\activate # Windows
pip install -r requirements.txt
- Run Data Retrieval Notebook π₯
jupyter notebook Data_retrieval.ipynb
- Run Topic Modeling Notebook π§
jupyter notebook main.ipynb
- Word Clouds showing top words in topics π₯οΈ
- Bar Charts ranking most common topics π
- Tables listing top keywords per topic π
Contributions are welcome! Feel free to fork the repository, create a new branch, and submit a pull request. π€
This project is licensed under the MIT License. π
π© Developed by Nikhil Bhandari |