BBC News Topic Modelling

Project Overview

This project demonstrates an end-to-end NLP pipeline — from scraping BBC news articles, preprocessing the text, and applying topic modelling techniques such as Latent Dirichlet Allocation (LDA) and Dynamic Topic Modelling (DTM) to discover hidden themes in the data.

It is designed both as a research project and a portfolio piece to showcase skills in data collection, cleaning, and unsupervised NLP.

Highlights

Web scraping of BBC news articles (converted into CSV dataset).
Text preprocessing: tokenization, stopword removal, stemming/lemmatization.
Topic Modelling:
- LDA (Latent Dirichlet Allocation) for static topics.
- DTM (Dynamic Topic Modelling) for temporal analysis.
Visualizations:
- Top terms per topic
- Topic distribution
- Interactive pyLDAvis plot
Final interpreted results saved in CSV & report.

Repository Structure

├─ data/                  # Raw & processed datasets (CSV files)
├─ src/                   # Main R script: final-project.R
├─ notebooks/             # (Optional) Jupyter or R notebooks for demos
├─ images/                # Visual outputs (ldaVis, topic distribution, etc.)
├─ report/                # Report.docx or PDF
├─ requirements.txt       # Python dependencies
├─ environment.yml        # Conda environment file
├─ .gitignore             # Ignore unnecessary files
└─ README.md              # Project documentation

Quick Start

1. Clone the repository

git clone https://github.com/<your-username>/bbc-news-topic-modelling.git
cd bbc-news-topic-modelling

2. Setup environment

Option A: pip

python -m venv venv
source venv/bin/activate   # Linux/macOS
venv\Scripts\activate      # Windows
pip install -r requirements.txt

Option B: Conda

conda env create -f environment.yml
conda activate topic-modelling

3. Run the analysis

Open src/final-project.R in RStudio and run the workflow, or
Open a demo notebook from notebooks/.

Example Outputs

Content Distribution

Topic Distribution

Interactive pyLDAvis Visualization

Report

The detailed methodology, results, and interpretations are available in report/Report.docx.

License

This project is released under the MIT License — you are free to use and adapt it with proper attribution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BBC News Topic Modelling

Project Overview

Highlights

Repository Structure

Quick Start

1. Clone the repository

2. Setup environment

3. Run the analysis

Example Outputs

Content Distribution

Topic Distribution

Interactive pyLDAvis Visualization

Report

License

About

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
images		images
report		report
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

License

sameer-at-git/BBC-News-Topic-Modelling

Folders and files

Latest commit

History

Repository files navigation

BBC News Topic Modelling

Project Overview

Highlights

Repository Structure

Quick Start

1. Clone the repository

2. Setup environment

3. Run the analysis

Example Outputs

Content Distribution

Topic Distribution

Interactive pyLDAvis Visualization

Report

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages