This project demonstrates an end-to-end NLP pipeline — from scraping BBC news articles, preprocessing the text, and applying topic modelling techniques such as Latent Dirichlet Allocation (LDA) and Dynamic Topic Modelling (DTM) to discover hidden themes in the data.
It is designed both as a research project and a portfolio piece to showcase skills in data collection, cleaning, and unsupervised NLP.
- Web scraping of BBC news articles (converted into CSV dataset).
- Text preprocessing: tokenization, stopword removal, stemming/lemmatization.
- Topic Modelling:
- LDA (Latent Dirichlet Allocation) for static topics.
- DTM (Dynamic Topic Modelling) for temporal analysis.
- Visualizations:
- Top terms per topic
- Topic distribution
- Interactive pyLDAvis plot
- Final interpreted results saved in CSV & report.
├─ data/ # Raw & processed datasets (CSV files)
├─ src/ # Main R script: final-project.R
├─ notebooks/ # (Optional) Jupyter or R notebooks for demos
├─ images/ # Visual outputs (ldaVis, topic distribution, etc.)
├─ report/ # Report.docx or PDF
├─ requirements.txt # Python dependencies
├─ environment.yml # Conda environment file
├─ .gitignore # Ignore unnecessary files
└─ README.md # Project documentation
git clone https://github.com/<your-username>/bbc-news-topic-modelling.git
cd bbc-news-topic-modellingOption A: pip
python -m venv venv
source venv/bin/activate # Linux/macOS
venv\Scripts\activate # Windows
pip install -r requirements.txtOption B: Conda
conda env create -f environment.yml
conda activate topic-modelling- Open
src/final-project.Rin RStudio and run the workflow, or - Open a demo notebook from
notebooks/.
The detailed methodology, results, and interpretations are available in report/Report.docx.
This project is released under the MIT License — you are free to use and adapt it with proper attribution.


