This repository contains a task from my MSc Machine Learning on Big Data coursework (at University of East London).
The aim is to study word frequency distributions on a large English corpus using PySpark, and to explore how estimates from small samples compare with the full dataset.
The task also demonstrates how classic ideas such as Zipf's law and Good‑Turing discounting can be implemented in a distributed environment.
⚠️ Note on academic integrity:
This repository is a restructured, summarised version of work originally completed as part of a university assignment. It is shared here purely as a portfolio of practical skills in Machine Learning on Big Data. Anyone using this material for their own coursework should adapt, extend, and properly acknowledge it rather than copying directly.
- Load and preprocess a large text corpus using PySpark.
- Compute word frequency distributions at different sample sizes.
- Compare how well small samples approximate the global frequency distribution.
- Explore Good‑Turing discounting using frequency‑of‑frequencies statistics.
- Corpus of Contemporary American English (COCA) (English-Corpora.org) – English text samples (contain ~9.5 million words) used for large‑scale word frequency analysis.
- In the original coursework the corpus file was stored on HDFS, e.g.:
hdfs://localhost:9000/AssignmentDatasets/COCA_English_Corpora.txt
The corpus is not provided in this repository.
Please use your own licensed copy or a suitable alternative corpus and adjust the input path accordingly.
- Python 3.x
- Apache Spark / PySpark
- (Optional) Hadoop HDFS
- Basic Python libraries for post‑processing (e.g.
collections,pandas,matplotlib).
word-frequency-estimation/
├── README.md
├── requirements.txt
├── .gitignore
├── screenshots
└── src/
└── word_frequency_estimation.py
word_frequency_estimation.py– main PySpark script to compute word frequency distributions and sample‑based estimates.
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -r requirements.txtEnsure that Spark is installed and configured correctly.
Open src/word_frequency_estimation.py and adjust:
CORPUS_PATH– path to your COCA (or other) corpus file.SAMPLE_SIZES– a list of token counts for the subsets you want to analyse (e.g.[1_000, 10_000, 100_000, 1_000_000]).
spark-submit src/word_frequency_estimation.pyThe script will:
- Load and clean the corpus (lowercasing, punctuation removal, tokenisation).
- Compute global word frequency counts.
- Draw random subsets of various sizes and compute their frequency distributions.
- Print summary statistics and simple comparisons between sample and full‑corpus frequencies.
- Illustrate Good‑Turing style frequency‑of‑frequencies counts to motivate discounting for rare words.
These are some screenshots from my original coursework showing the outputs I received throughout my work.
-
Dataset Pre-processing Steps
Output showing successful completion of Data Cleaning and Tokenization
-
Raw Frequency Count
Top 20 most frequent words from the 10,000-word subset
-
Good-Turing Discounting
output displays how often different word frequencies occur, supporting Good-Turing smoothing logic
-
Comparison with Baseline Distribution (using Cosine similarity)
Terminal output showing cosine similarity scores across sample sizes
-
Visualisation
Cosine similarity of sample word distributions vs. baseline
- For reproducibility the script uses a fixed random seed when sampling.
- You can extend the script to produce plots (e.g. log‑rank vs log‑frequency to illustrate Zipf's law).
- The code is purposely written in a clear, step‑by‑step style, reflecting its origin as part of an assignment on machine learning with big data tools rather than a production library.