Multi‑Class Classification of News Headlines with PySpark

This repository contains a project from my MSc Machine Learning on Big Data coursework (at University of East London).
Here, the goal is to build and evaluate a multi‑class text classifier for news headlines using the PySpark ML pipeline API.

The project demonstrates how to take a large labelled text dataset, apply distributed preprocessing and feature extraction, and train a scalable classifier on top of it.

⚠️ Note on academic integrity:
This repository is a restructured, summarised version of work originally completed as part of a university assignment. It is shared here purely as a portfolio of practical skills in Machine Learning on Big Data. Anyone using this material for their own coursework should adapt, extend, and properly acknowledge it rather than copying directly.

Objectives

Load a large labelled news dataset into Spark.
Apply text preprocessing at scale (cleaning, tokenisation, stop‑word removal).
Learn dense vector representations of headlines using Word2Vec (or another vectoriser).
Train a multi‑class classifier (e.g. Logistic Regression) to predict news categories.
Evaluate performance using standard metrics on a held‑out test set.

Dataset

News Category Dataset (Rishabh, 2022) – JSON file with ~200k news headlines and their category labels. [Available at: https://www.kaggle.com/datasets/rmisra/news-category-dataset]
In the original coursework this was stored on HDFS, e.g.
hdfs://localhost:9000/AssignmentDatasets/News_Category_Dataset_v3.json

The dataset is not included in this repository.
Please download it separately and adjust the input path in the script.

Tech Stack

Python 3.x
Apache Spark / PySpark (MLlib)
Optional: pandas/sklearn for further analysis of the results.

Repository Structure

news-multiclass-classification/
├── README.md
├── requirements.txt
├── .gitignore
├── screenshots
└── src/
    └── news_multiclass_classifier.py

news_multiclass_classifier.py – end‑to‑end pipeline for preprocessing, feature extraction, model training and evaluation.

Getting Started

1. Environment

python -m venv .venv
source .venv/bin/activate        # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Make sure Spark is installed and accessible to PySpark.

2. Configure paths / hyperparameters

In src/news_multiclass_classifier.py, update:

NEWS_DATA_PATH – path to the JSON news dataset.
TRAIN_TEST_SPLIT – train/test proportions (default 0.8/0.2).
Any other hyperparameters (vector size, regularisation, etc.).

3. Train and evaluate

spark-submit src/news_multiclass_classifier.py

The script will:

Load and preprocess the headlines.
Build a pipeline of Tokenizer / StopWordsRemover / Word2Vec (or similar) + StringIndexer for labels.
Train a logistic regression classifier.
Evaluate accuracy and macro F1 on the test split.
Print some example predictions.

Example Screenshots

These are some screenshots from my original coursework showing the outputs I received throughout my work.

Dataset Pre-processing Steps

Output showing successful Loading & Pre-processing for the Data
String Indexing for Labels (Categories)

Top 5 rows’ categories (topics) and their index
Classifier – Logistic Regression
Evaluation

Test Accuracy Output: (42%)

One potential reason for the moderate accuracy is that, although the model is strictly trained for single-label (multi-class) classification—assigning each headline to exactly one category—many news headlines in the dataset could reasonably belong to multiple categories at once. This introduces inherent multi-label ambiguity, which the current approach does not capture. In addition, using only the headline text limits the contextual information available for classification.

Accuracy could potentially be improved by also including the "short_description" field from the dataset, which offers a fuller representation of the article content. Additionally, fine-tuning or introducing deep learning-based embeddings (like BERT) could further improve accuracy.
Confusion Matrix Overview

Notes

The pipeline structure follows the design described in my Assignment report, where the focus was on showing good practice in scalable ML pipelines rather than pushing for the very best possible accuracy.
You can easily extend this by swapping in other classifiers (Random Forest, Linear SVM, etc.) or using TF‑IDF features.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Multi‑Class Classification of News Headlines with PySpark

Objectives

Dataset

Tech Stack

Repository Structure

Getting Started

1. Environment

2. Configure paths / hyperparameters

3. Train and evaluate

Example Screenshots

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
screenshots		screenshots
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

DrFarouk/news-multiclass-classification

Folders and files

Latest commit

History

Repository files navigation

Multi‑Class Classification of News Headlines with PySpark

Objectives

Dataset

Tech Stack

Repository Structure

Getting Started

1. Environment

2. Configure paths / hyperparameters

3. Train and evaluate

Example Screenshots

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages