A beginner-friendly text mining project written in Python.
It reads multiple .txt files from a folder, cleans the text in English, and generates the top N most frequent words per document and overall.
The dataset consists of the top 5 ebooks from Project Gutenberg at the time of this project’s creation:
- Alice's Adventures in Wonderland — Lewis Carroll
- Frankenstein; or, The Modern Prometheus — Mary Wollstonecraft Shelley
- Moby Dick; or, The Whale — Herman Melville
- Pride and Prejudice — Jane Austen
- Romeo and Juliet — William Shakespeare
All texts are stored in the data/ folder in plain .txt format.
- Python 3.10–3.12
- Install dependencies:
python -m venv .venv.venv\Scripts\activatepip install -r requirements.txtimport nltknltk.download("stopwords")This repository uses a flat src/ layout, so run the script directly:
python src/main.py --input data --output output --top 20or with short flags
python src/main.py -i data -o output -t 20CSV files are written to output/:
frequencies_global.csvtop_global.csvtop_per_document.csv
Note:
output/is ignored in version control (only.gitkeepis tracked).
Small preview images live inexamples/(e.g.,examples/top_global.png,examples/top_per_document.png).
To reproduce the same results:
.venv\Scripts\activatepip install -r requirements.txtimport nltknltk.download("stopwords")python src/main.py -i data -o output -t 20