Skip to content

An MSc project exploring large-scale news headline classification with PySpark, implementing an end-to-end ML pipeline (tokenisation, word embeddings and multi-class classification) to train and evaluate a scalable text classifier on a real-world news dataset.

Notifications You must be signed in to change notification settings

DrFarouk/news-multiclass-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi‑Class Classification of News Headlines with PySpark

This repository contains a project from my MSc Machine Learning on Big Data coursework (at University of East London).
Here, the goal is to build and evaluate a multi‑class text classifier for news headlines using the PySpark ML pipeline API.

The project demonstrates how to take a large labelled text dataset, apply distributed preprocessing and feature extraction, and train a scalable classifier on top of it.


⚠️ Note on academic integrity:
This repository is a restructured, summarised version of work originally completed as part of a university assignment. It is shared here purely as a portfolio of practical skills in Machine Learning on Big Data. Anyone using this material for their own coursework should adapt, extend, and properly acknowledge it rather than copying directly.


Objectives

  • Load a large labelled news dataset into Spark.
  • Apply text preprocessing at scale (cleaning, tokenisation, stop‑word removal).
  • Learn dense vector representations of headlines using Word2Vec (or another vectoriser).
  • Train a multi‑class classifier (e.g. Logistic Regression) to predict news categories.
  • Evaluate performance using standard metrics on a held‑out test set.

Dataset

  • News Category Dataset (Rishabh, 2022) – JSON file with ~200k news headlines and their category labels. [Available at: https://www.kaggle.com/datasets/rmisra/news-category-dataset]
  • In the original coursework this was stored on HDFS, e.g.
    hdfs://localhost:9000/AssignmentDatasets/News_Category_Dataset_v3.json

The dataset is not included in this repository.
Please download it separately and adjust the input path in the script.


Tech Stack

  • Python 3.x
  • Apache Spark / PySpark (MLlib)
  • Optional: pandas/sklearn for further analysis of the results.

Repository Structure

news-multiclass-classification/
├── README.md
├── requirements.txt
├── .gitignore
├── screenshots
└── src/
    └── news_multiclass_classifier.py
  • news_multiclass_classifier.py – end‑to‑end pipeline for preprocessing, feature extraction, model training and evaluation.

Getting Started

1. Environment

python -m venv .venv
source .venv/bin/activate        # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Make sure Spark is installed and accessible to PySpark.

2. Configure paths / hyperparameters

In src/news_multiclass_classifier.py, update:

  • NEWS_DATA_PATH – path to the JSON news dataset.
  • TRAIN_TEST_SPLIT – train/test proportions (default 0.8/0.2).
  • Any other hyperparameters (vector size, regularisation, etc.).

3. Train and evaluate

spark-submit src/news_multiclass_classifier.py

The script will:

  1. Load and preprocess the headlines.
  2. Build a pipeline of Tokenizer / StopWordsRemover / Word2Vec (or similar) + StringIndexer for labels.
  3. Train a logistic regression classifier.
  4. Evaluate accuracy and macro F1 on the test split.
  5. Print some example predictions.

Example Screenshots

These are some screenshots from my original coursework showing the outputs I received throughout my work.

  1. Dataset Pre-processing Steps

    Screenshot

    Output showing successful Loading & Pre-processing for the Data

  2. String Indexing for Labels (Categories)

    Screenshot

    Top 5 rows’ categories (topics) and their index

  3. Classifier – Logistic Regression

    Screenshot
  4. Evaluation

    Screenshot

    Test Accuracy Output: (42%)

    One potential reason for the moderate accuracy is that, although the model is strictly trained for single-label (multi-class) classification—assigning each headline to exactly one category—many news headlines in the dataset could reasonably belong to multiple categories at once. This introduces inherent multi-label ambiguity, which the current approach does not capture. In addition, using only the headline text limits the contextual information available for classification.

    Accuracy could potentially be improved by also including the "short_description" field from the dataset, which offers a fuller representation of the article content. Additionally, fine-tuning or introducing deep learning-based embeddings (like BERT) could further improve accuracy.

  5. Confusion Matrix Overview

    Screenshot

Notes

  • The pipeline structure follows the design described in my Assignment report, where the focus was on showing good practice in scalable ML pipelines rather than pushing for the very best possible accuracy.
  • You can easily extend this by swapping in other classifiers (Random Forest, Linear SVM, etc.) or using TF‑IDF features.

About

An MSc project exploring large-scale news headline classification with PySpark, implementing an end-to-end ML pipeline (tokenisation, word embeddings and multi-class classification) to train and evaluate a scalable text classifier on a real-world news dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages