CountVectorizer Comparison Project

A project for comparing various text preprocessing methods when using CountVectorizer in machine learning.

Project Description

This project compares 5 different approaches to text vectorization using CountVectorizer from the scikit-learn library:

Base Method - Standard CountVectorizer without additional processing
With Stop Words Removal - Removal of common words (e.g., "the", "and", "is")
With Lemmatization - Reducing words to their normal form (lemma)
With Stemming - Trimming word endings to their root (stem)
With Simple Tokenizer - Fast splitting by spaces without punctuation handling

The project uses the BBC news dataset for classification and compares the effectiveness of each method across several metrics.

Quick Start

Installation

Clone the repository:

git clone https://github.com/IlyaShaposhnikov/count-vectorizer.git
cd count-vectorizer

Install dependencies:

pip install -r requirements.txt

Download the dataset and place it into the data folder:

Download the dataset on Kaggle

Usage

Run the main comparison script:

python main.py

The script will automatically:

Download necessary NLTK resources (On the first run, this may take a few minutes depending on your internet connection speed. Subsequent runs will be instantaneous.)
Load and prepare the data
Train models with different vectorization methods
Print the comparison results
Generate visualizations

Project Structure

count-vectorizer/
├── data/                               # Data folder
│   └── bbc_text_cls.csv                # BBC News Dataset
├── methods/                            # Implementations of various vectorization methods
│   ├── base_vectorizer.py              # Base method
│   ├── stopwords_vectorizer.py         # With stop words removal
│   ├── lemmatization_vectorizer.py     # With lemmatization
│   ├── stemming_vectorizer.py          # With stemming
│   └── simple_tokenizer_vectorizer.py  # With simple tokenizer
├── utils/                              # Utility scripts
│   ├── data_loader.py                  # Data loading utilities
│   ├── nltk_utils.py                   # NLTK resource management and helpers
│   ├── reporting.py                    # Result printing and saving
│   ├── vectorizer_utils.py             # Common vectorizer metrics calculation
│   └── visualization.py                # Result plotting
├── results/                            # Folder for saving results
├── main.py                             # Main comparison script
├── requirements.txt                    # Project dependencies
├── README.md                           # Project documentation (English)
└── README.ru.md                        # Project documentation (Russian)

Comparison Metrics

Each method is evaluated based on the following metrics:

Training Accuracy - Model accuracy on the training data
Test Accuracy - Model accuracy on new data (primary metric)
Vocabulary Size - Number of unique words after processing
Matrix Density - Percentage of non-zero elements in the feature matrix
Execution Time - Total time for training and evaluation of the method

Technical Details

Libraries Used

numpy - Working with numerical arrays
pandas - Data processing and analysis
scikit-learn - Machine learning and CountVectorizer
nltk - Natural language processing
matplotlib/seaborn - Result visualization
tabulate - Human-readable table display

Dataset

The project uses the BBC News dataset, containing 2225 documents across 5 categories:

business (510 documents)
entertainment (386 documents)
politics (417 documents)
sport (511 documents)
tech (401 document)

Classification Algorithm

Multinomial Naive Bayes is used for all methods - an algorithm well-suited for text classification.

Result Output

After running main.py, you will see:

Progress of each method execution
A detailed comparison table
Graphs with result visualizations
Saved results: detailed_results.csv table and comparison_results.png graph in the results folder

Key Findings

The comparison demonstrates the following trends:

Stop-word Removal: Generally reduces dimensionality (vocabulary size) and often improves classification accuracy compared to the base method.
Lemmatization vs. Stemming: Lemmatization usually achieves higher accuracy than stemming (e.g., ~97.3% vs ~96.8% in typical runs), but at a significant computational cost, resulting in much longer execution times.
Speed vs. Accuracy: Simpler methods (like removing stop-words or using a simple tokenizer) are generally much faster than complex ones (like lemmatization or stemming). The stop-words removal method often provides a good balance, frequently achieving high accuracy while remaining relatively fast. However, the absolute fastest method might vary slightly between runs depending on the system.
Vocabulary Size & Density: More aggressive text processing (like stemming or lemmatization) tends to reduce vocabulary size compared to simpler methods. This often leads to a higher feature matrix density (as the same concepts are represented by fewer unique terms), although the density remains quite low overall (typically well below 2%).

Author

Ilya Shaposhnikov | E-mail | LinkedIn

Russian Version / На русском

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CountVectorizer Comparison Project

Project Description

Quick Start

Installation

Usage

Project Structure

Comparison Metrics

Technical Details

Libraries Used

Dataset

Classification Algorithm

Result Output

Key Findings

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
methods		methods
utils		utils
.gitignore		.gitignore
README.md		README.md
README.ru.md		README.ru.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

CountVectorizer Comparison Project

Project Description

Quick Start

Installation

Usage

Project Structure

Comparison Metrics

Technical Details

Libraries Used

Dataset

Classification Algorithm

Result Output

Key Findings

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages