Skip to content

A coursework-style project from my MSc Machine Learning on Big Data (University of East London), using PySpark to compute word frequency distributions on a large English corpus (~9.5 million words) and to compare frequency estimates from small samples against the full dataset.

Notifications You must be signed in to change notification settings

DrFarouk/word-frequency-estimation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word Frequency Estimation on a Large English Corpus

This repository contains a task from my MSc Machine Learning on Big Data coursework (at University of East London).
The aim is to study word frequency distributions on a large English corpus using PySpark, and to explore how estimates from small samples compare with the full dataset.

The task also demonstrates how classic ideas such as Zipf's law and Good‑Turing discounting can be implemented in a distributed environment.


⚠️ Note on academic integrity:
This repository is a restructured, summarised version of work originally completed as part of a university assignment. It is shared here purely as a portfolio of practical skills in Machine Learning on Big Data. Anyone using this material for their own coursework should adapt, extend, and properly acknowledge it rather than copying directly.


Objectives

  • Load and preprocess a large text corpus using PySpark.
  • Compute word frequency distributions at different sample sizes.
  • Compare how well small samples approximate the global frequency distribution.
  • Explore Good‑Turing discounting using frequency‑of‑frequencies statistics.

Dataset

  • Corpus of Contemporary American English (COCA) (English-Corpora.org) – English text samples (contain ~9.5 million words) used for large‑scale word frequency analysis.
  • In the original coursework the corpus file was stored on HDFS, e.g.:
    hdfs://localhost:9000/AssignmentDatasets/COCA_English_Corpora.txt

The corpus is not provided in this repository.
Please use your own licensed copy or a suitable alternative corpus and adjust the input path accordingly.


Tech Stack

  • Python 3.x
  • Apache Spark / PySpark
  • (Optional) Hadoop HDFS
  • Basic Python libraries for post‑processing (e.g. collections, pandas, matplotlib).

Repository Structure

word-frequency-estimation/
├── README.md
├── requirements.txt
├── .gitignore
├── screenshots
└── src/
    └── word_frequency_estimation.py
  • word_frequency_estimation.py – main PySpark script to compute word frequency distributions and sample‑based estimates.

Getting Started

1. Environment

python -m venv .venv
source .venv/bin/activate        # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Ensure that Spark is installed and configured correctly.

2. Configure paths and sample sizes

Open src/word_frequency_estimation.py and adjust:

  • CORPUS_PATH – path to your COCA (or other) corpus file.
  • SAMPLE_SIZES – a list of token counts for the subsets you want to analyse (e.g. [1_000, 10_000, 100_000, 1_000_000]).

3. Run the script

spark-submit src/word_frequency_estimation.py

The script will:

  1. Load and clean the corpus (lowercasing, punctuation removal, tokenisation).
  2. Compute global word frequency counts.
  3. Draw random subsets of various sizes and compute their frequency distributions.
  4. Print summary statistics and simple comparisons between sample and full‑corpus frequencies.
  5. Illustrate Good‑Turing style frequency‑of‑frequencies counts to motivate discounting for rare words.

Example Screenshots

These are some screenshots from my original coursework showing the outputs I received throughout my work.

  1. Dataset Pre-processing Steps

    Screenshot

    Output showing successful completion of Data Cleaning and Tokenization

  2. Raw Frequency Count

    Screenshot

    Top 20 most frequent words from the 10,000-word subset

  3. Good-Turing Discounting

    Screenshot

    output displays how often different word frequencies occur, supporting Good-Turing smoothing logic

  4. Comparison with Baseline Distribution (using Cosine similarity)

    Screenshot

    Terminal output showing cosine similarity scores across sample sizes

  5. Visualisation

    plot

    Cosine similarity of sample word distributions vs. baseline


Notes

  • For reproducibility the script uses a fixed random seed when sampling.
  • You can extend the script to produce plots (e.g. log‑rank vs log‑frequency to illustrate Zipf's law).
  • The code is purposely written in a clear, step‑by‑step style, reflecting its origin as part of an assignment on machine learning with big data tools rather than a production library.

About

A coursework-style project from my MSc Machine Learning on Big Data (University of East London), using PySpark to compute word frequency distributions on a large English corpus (~9.5 million words) and to compare frequency estimates from small samples against the full dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages