Word Frequency Estimation on a Large English Corpus

This repository contains a task from my MSc Machine Learning on Big Data coursework (at University of East London).
The aim is to study word frequency distributions on a large English corpus using PySpark, and to explore how estimates from small samples compare with the full dataset.

The task also demonstrates how classic ideas such as Zipf's law and Good‑Turing discounting can be implemented in a distributed environment.

⚠️ Note on academic integrity:
This repository is a restructured, summarised version of work originally completed as part of a university assignment. It is shared here purely as a portfolio of practical skills in Machine Learning on Big Data. Anyone using this material for their own coursework should adapt, extend, and properly acknowledge it rather than copying directly.

Objectives

Load and preprocess a large text corpus using PySpark.
Compute word frequency distributions at different sample sizes.
Compare how well small samples approximate the global frequency distribution.
Explore Good‑Turing discounting using frequency‑of‑frequencies statistics.

Dataset

Corpus of Contemporary American English (COCA) (English-Corpora.org) – English text samples (contain ~9.5 million words) used for large‑scale word frequency analysis.
In the original coursework the corpus file was stored on HDFS, e.g.:
hdfs://localhost:9000/AssignmentDatasets/COCA_English_Corpora.txt

The corpus is not provided in this repository.
Please use your own licensed copy or a suitable alternative corpus and adjust the input path accordingly.

Tech Stack

Python 3.x
Apache Spark / PySpark
(Optional) Hadoop HDFS
Basic Python libraries for post‑processing (e.g. collections, pandas, matplotlib).

Repository Structure

word-frequency-estimation/
├── README.md
├── requirements.txt
├── .gitignore
├── screenshots
└── src/
    └── word_frequency_estimation.py

word_frequency_estimation.py – main PySpark script to compute word frequency distributions and sample‑based estimates.

Getting Started

1. Environment

python -m venv .venv
source .venv/bin/activate        # On Windows: .venv\Scripts\activate
pip install -r requirements.txt

Ensure that Spark is installed and configured correctly.

2. Configure paths and sample sizes

Open src/word_frequency_estimation.py and adjust:

CORPUS_PATH – path to your COCA (or other) corpus file.
SAMPLE_SIZES – a list of token counts for the subsets you want to analyse (e.g. [1_000, 10_000, 100_000, 1_000_000]).

3. Run the script

spark-submit src/word_frequency_estimation.py

The script will:

Load and clean the corpus (lowercasing, punctuation removal, tokenisation).
Compute global word frequency counts.
Draw random subsets of various sizes and compute their frequency distributions.
Print summary statistics and simple comparisons between sample and full‑corpus frequencies.
Illustrate Good‑Turing style frequency‑of‑frequencies counts to motivate discounting for rare words.

Example Screenshots

These are some screenshots from my original coursework showing the outputs I received throughout my work.

Dataset Pre-processing Steps

Output showing successful completion of Data Cleaning and Tokenization
Raw Frequency Count

Top 20 most frequent words from the 10,000-word subset
Good-Turing Discounting

output displays how often different word frequencies occur, supporting Good-Turing smoothing logic
Comparison with Baseline Distribution (using Cosine similarity)

Terminal output showing cosine similarity scores across sample sizes
Visualisation

Cosine similarity of sample word distributions vs. baseline

Notes

For reproducibility the script uses a fixed random seed when sampling.
You can extend the script to produce plots (e.g. log‑rank vs log‑frequency to illustrate Zipf's law).
The code is purposely written in a clear, step‑by‑step style, reflecting its origin as part of an assignment on machine learning with big data tools rather than a production library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Word Frequency Estimation on a Large English Corpus

Objectives

Dataset

Tech Stack

Repository Structure

Getting Started

1. Environment

2. Configure paths and sample sizes

3. Run the script

Example Screenshots

Notes

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
screenshots		screenshots
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

DrFarouk/word-frequency-estimation

Folders and files

Latest commit

History

Repository files navigation

Word Frequency Estimation on a Large English Corpus

Objectives

Dataset

Tech Stack

Repository Structure

Getting Started

1. Environment

2. Configure paths and sample sizes

3. Run the script

Example Screenshots

Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages