Skip to content

sameer-at-git/BBC-News-Topic-Modelling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

BBC News Topic Modelling

Project Overview

This project demonstrates an end-to-end NLP pipeline — from scraping BBC news articles, preprocessing the text, and applying topic modelling techniques such as Latent Dirichlet Allocation (LDA) and Dynamic Topic Modelling (DTM) to discover hidden themes in the data.

It is designed both as a research project and a portfolio piece to showcase skills in data collection, cleaning, and unsupervised NLP.


Highlights

  • Web scraping of BBC news articles (converted into CSV dataset).
  • Text preprocessing: tokenization, stopword removal, stemming/lemmatization.
  • Topic Modelling:
    • LDA (Latent Dirichlet Allocation) for static topics.
    • DTM (Dynamic Topic Modelling) for temporal analysis.
  • Visualizations:
    • Top terms per topic
    • Topic distribution
    • Interactive pyLDAvis plot
  • Final interpreted results saved in CSV & report.

Repository Structure

├─ data/                  # Raw & processed datasets (CSV files)
├─ src/                   # Main R script: final-project.R
├─ notebooks/             # (Optional) Jupyter or R notebooks for demos
├─ images/                # Visual outputs (ldaVis, topic distribution, etc.)
├─ report/                # Report.docx or PDF
├─ requirements.txt       # Python dependencies
├─ environment.yml        # Conda environment file
├─ .gitignore             # Ignore unnecessary files
└─ README.md              # Project documentation

Quick Start

1. Clone the repository

git clone https://github.com/<your-username>/bbc-news-topic-modelling.git
cd bbc-news-topic-modelling

2. Setup environment

Option A: pip

python -m venv venv
source venv/bin/activate   # Linux/macOS
venv\Scripts\activate      # Windows
pip install -r requirements.txt

Option B: Conda

conda env create -f environment.yml
conda activate topic-modelling

3. Run the analysis

  • Open src/final-project.R in RStudio and run the workflow, or
  • Open a demo notebook from notebooks/.

Example Outputs

Content Distribution

Content Distribution

Topic Distribution

Topic Distribution

Interactive pyLDAvis Visualization

pyLDAvis


Report

The detailed methodology, results, and interpretations are available in report/Report.docx.


License

This project is released under the MIT License — you are free to use and adapt it with proper attribution.

About

BBC news dataset pipeline : data collection, cleaning, and topic modelling with LDA/DTM

Topics

Resources

License

Stars

Watchers

Forks

Languages