This repository contains a project from my MSc Machine Learning on Big Data coursework (at University of East London).
Here, the goal is to build and evaluate a multi‑class text classifier for news headlines using the PySpark ML pipeline API.
The project demonstrates how to take a large labelled text dataset, apply distributed preprocessing and feature extraction, and train a scalable classifier on top of it.
⚠️ Note on academic integrity:
This repository is a restructured, summarised version of work originally completed as part of a university assignment. It is shared here purely as a portfolio of practical skills in Machine Learning on Big Data. Anyone using this material for their own coursework should adapt, extend, and properly acknowledge it rather than copying directly.
- Load a large labelled news dataset into Spark.
- Apply text preprocessing at scale (cleaning, tokenisation, stop‑word removal).
- Learn dense vector representations of headlines using Word2Vec (or another vectoriser).
- Train a multi‑class classifier (e.g. Logistic Regression) to predict news categories.
- Evaluate performance using standard metrics on a held‑out test set.
- News Category Dataset (Rishabh, 2022) – JSON file with ~200k news headlines and their category labels. [Available at: https://www.kaggle.com/datasets/rmisra/news-category-dataset]
- In the original coursework this was stored on HDFS, e.g.
hdfs://localhost:9000/AssignmentDatasets/News_Category_Dataset_v3.json
The dataset is not included in this repository.
Please download it separately and adjust the input path in the script.
- Python 3.x
- Apache Spark / PySpark (MLlib)
- Optional:
pandas/sklearnfor further analysis of the results.
news-multiclass-classification/
├── README.md
├── requirements.txt
├── .gitignore
├── screenshots
└── src/
└── news_multiclass_classifier.py
news_multiclass_classifier.py– end‑to‑end pipeline for preprocessing, feature extraction, model training and evaluation.
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtMake sure Spark is installed and accessible to PySpark.
In src/news_multiclass_classifier.py, update:
NEWS_DATA_PATH– path to the JSON news dataset.TRAIN_TEST_SPLIT– train/test proportions (default 0.8/0.2).- Any other hyperparameters (vector size, regularisation, etc.).
spark-submit src/news_multiclass_classifier.pyThe script will:
- Load and preprocess the headlines.
- Build a pipeline of
Tokenizer/StopWordsRemover/Word2Vec(or similar) +StringIndexerfor labels. - Train a logistic regression classifier.
- Evaluate accuracy and macro F1 on the test split.
- Print some example predictions.
These are some screenshots from my original coursework showing the outputs I received throughout my work.
-
Dataset Pre-processing Steps
Output showing successful Loading & Pre-processing for the Data
-
String Indexing for Labels (Categories)
Top 5 rows’ categories (topics) and their index
-
Classifier – Logistic Regression
-
Evaluation
Test Accuracy Output: (42%)
One potential reason for the moderate accuracy is that, although the model is strictly trained for single-label (multi-class) classification—assigning each headline to exactly one category—many news headlines in the dataset could reasonably belong to multiple categories at once. This introduces inherent multi-label ambiguity, which the current approach does not capture. In addition, using only the headline text limits the contextual information available for classification.
Accuracy could potentially be improved by also including the "short_description" field from the dataset, which offers a fuller representation of the article content. Additionally, fine-tuning or introducing deep learning-based embeddings (like BERT) could further improve accuracy.
-
Confusion Matrix Overview
- The pipeline structure follows the design described in my Assignment report, where the focus was on showing good practice in scalable ML pipelines rather than pushing for the very best possible accuracy.
- You can easily extend this by swapping in other classifiers (Random Forest, Linear SVM, etc.) or using TF‑IDF features.