A project for comparing various text preprocessing methods when using CountVectorizer in machine learning.
This project compares 5 different approaches to text vectorization using CountVectorizer from the scikit-learn library:
- Base Method - Standard CountVectorizer without additional processing
- With Stop Words Removal - Removal of common words (e.g., "the", "and", "is")
- With Lemmatization - Reducing words to their normal form (lemma)
- With Stemming - Trimming word endings to their root (stem)
- With Simple Tokenizer - Fast splitting by spaces without punctuation handling
The project uses the BBC news dataset for classification and compares the effectiveness of each method across several metrics.
- Clone the repository:
git clone https://github.com/IlyaShaposhnikov/count-vectorizer.git
cd count-vectorizer- Install dependencies:
pip install -r requirements.txt- Download the dataset and place it into the
datafolder:
Download the dataset on Kaggle
Run the main comparison script:
python main.pyThe script will automatically:
- Download necessary NLTK resources (On the first run, this may take a few minutes depending on your internet connection speed. Subsequent runs will be instantaneous.)
- Load and prepare the data
- Train models with different vectorization methods
- Print the comparison results
- Generate visualizations
count-vectorizer/
├── data/ # Data folder
│ └── bbc_text_cls.csv # BBC News Dataset
├── methods/ # Implementations of various vectorization methods
│ ├── base_vectorizer.py # Base method
│ ├── stopwords_vectorizer.py # With stop words removal
│ ├── lemmatization_vectorizer.py # With lemmatization
│ ├── stemming_vectorizer.py # With stemming
│ └── simple_tokenizer_vectorizer.py # With simple tokenizer
├── utils/ # Utility scripts
│ ├── data_loader.py # Data loading utilities
│ ├── nltk_utils.py # NLTK resource management and helpers
│ ├── reporting.py # Result printing and saving
│ ├── vectorizer_utils.py # Common vectorizer metrics calculation
│ └── visualization.py # Result plotting
├── results/ # Folder for saving results
├── main.py # Main comparison script
├── requirements.txt # Project dependencies
├── README.md # Project documentation (English)
└── README.ru.md # Project documentation (Russian)
Each method is evaluated based on the following metrics:
- Training Accuracy - Model accuracy on the training data
- Test Accuracy - Model accuracy on new data (primary metric)
- Vocabulary Size - Number of unique words after processing
- Matrix Density - Percentage of non-zero elements in the feature matrix
- Execution Time - Total time for training and evaluation of the method
- numpy - Working with numerical arrays
- pandas - Data processing and analysis
- scikit-learn - Machine learning and CountVectorizer
- nltk - Natural language processing
- matplotlib/seaborn - Result visualization
- tabulate - Human-readable table display
The project uses the BBC News dataset, containing 2225 documents across 5 categories:
- business (510 documents)
- entertainment (386 documents)
- politics (417 documents)
- sport (511 documents)
- tech (401 document)
Multinomial Naive Bayes is used for all methods - an algorithm well-suited for text classification.
After running main.py, you will see:
- Progress of each method execution
- A detailed comparison table
- Graphs with result visualizations
- Saved results:
detailed_results.csvtable andcomparison_results.pnggraph in theresultsfolder
The comparison demonstrates the following trends:
- Stop-word Removal: Generally reduces dimensionality (vocabulary size) and often improves classification accuracy compared to the base method.
- Lemmatization vs. Stemming: Lemmatization usually achieves higher accuracy than stemming (e.g., ~97.3% vs ~96.8% in typical runs), but at a significant computational cost, resulting in much longer execution times.
- Speed vs. Accuracy: Simpler methods (like removing stop-words or using a simple tokenizer) are generally much faster than complex ones (like lemmatization or stemming). The stop-words removal method often provides a good balance, frequently achieving high accuracy while remaining relatively fast. However, the absolute fastest method might vary slightly between runs depending on the system.
- Vocabulary Size & Density: More aggressive text processing (like stemming or lemmatization) tends to reduce vocabulary size compared to simpler methods. This often leads to a higher feature matrix density (as the same concepts are represented by fewer unique terms), although the density remains quite low overall (typically well below 2%).