Multi-model text classification with TF-IDF, word embeddings, transformers, and ensemble methods β built as an educational portfolio project.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TEXT CLASSIFICATION PIPELINE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββ ββββββββββββββββ ββββββββββββββββββββββ β
β β Raw Data βββββΆβ Preprocess βββββΆβ Feature Extract β β
β β (CSV/ β β - Clean β β - TF-IDF β β
β β Synthetic)β β - Tokenize β β - Embeddings β β
β ββββββββββββ β - Encode β β - Combined β β
β ββββββββββββββββ ββββββββββ¬ββββββββββββ β
β β β
β ββββββββββββββββ ββββββββββΌββββββββββββ β
β β Evaluation ββββββ Classifiers β β
β β - Metrics β β - Naive Bayes β β
β β - Curves β β - Logistic Reg β β
β β - Calibratn β β - SVM β β
β β - Report β β - Random Forest β β
β ββββββββ¬ββββββββ β - Transformer β β
β β β - Ensemble β β
β ββββββββΌββββββββ ββββββββββββββββββββββ β
β β Serving β β
β β - FastAPI β ββββββββββββββββββββββ β
β β - Streamlitββββββ Hyperparameter β β
β βββββββββββββββ β Tuning (Grid) β β
β ββββββββββββββββββββββ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Cross-Validation β Data Augmentation β Model Ensemblingβ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
| Concept | Description |
|---|---|
| TF-IDF | Term FrequencyβInverse Document Frequency converts text into numerical vectors. Words that appear often in one document but rarely across the corpus get high scores, making them discriminative features. |
| Naive Bayes | Applies Bayes' theorem with a "naive" independence assumption between features. Despite the assumption rarely holding, it works remarkably well for text classification because word co-occurrence patterns still encode class information. |
| SVM (Support Vector Machine) | Finds the optimal hyperplane that maximizes the margin between classes. Linear SVMs are especially effective for high-dimensional sparse data like TF-IDF vectors. |
| Logistic Regression | Models the probability of class membership using a logistic function. Coefficients are directly interpretable as feature importance β perfect for understanding why the model makes each prediction. |
| Ensemble Methods | Combine multiple weak learners into a stronger predictor. Voting averages predictions; stacking trains a meta-learner on base model outputs to capture complementary strengths. |
| Confidence Calibration | Raw model scores aren't always true probabilities. Platt scaling (sigmoid) and isotonic regression transform scores so that "70% confident" actually means ~70% accuracy. |
| Cross-Validation | Stratified k-fold CV splits data into k folds while preserving class ratios, giving a lower-variance estimate of generalization performance than a single train/test split. |
| Data Validation | Checking for null values, empty strings, duplicates, and class imbalance before training prevents mysterious NaN losses and silently degraded accuracy β the most common source of ML bugs. |
| Model Persistence | Serializing trained models with joblib enables instant inference at API startup without retraining. Companion metadata (timestamp, hyperparameters) supports experiment tracking and reproducibility. |
| Inference Pipeline | Bundling preprocessing, feature extraction, and prediction into a single pipeline object prevents training-serving skew β the #1 cause of silent ML bugs in production. |
π See
docs/CONCEPTS.mdfor deeper explanations of each concept.
text-classification/
βββ src/
β βββ __init__.py
β βββ config.py # All configuration constants
β βββ evaluation.py # Metrics, confusion matrix, reporting
β βββ calibration.py # Platt scaling & isotonic regression
β βββ cross_validation.py # Stratified k-fold CV
β βββ learning_curves.py # Bias-variance diagnostics
β βββ model_comparison.py # Side-by-side model benchmarking
β βββ text_analyzer.py # Corpus-level text statistics
β βββ tuning.py # Grid search hyperparameter tuning
β βββ pipeline.py # End-to-end inference pipeline
β βββ persistence.py # Model save/load with metadata tracking
β βββ data/
β β βββ dataset_loader.py # Synthetic data & CSV loaders
β β βββ preprocessor.py # Text cleaning, tokenization, label encoding
β β βββ augmentor.py # Synonym replacement, random augmentation
β βββ features/
β β βββ tfidf_features.py # TF-IDF vectorizer with configurable n-grams
β β βββ embedding_features.py # SVD-based dense word embeddings
β β βββ feature_combiner.py # Merge TF-IDF + embedding features
β βββ models/
β β βββ baseline_models.py # NB, LR, SVM, Random Forest
β β βββ model_ensemble.py # Voting & stacking ensembles
β β βββ transformer_classifier.py # DistilBERT fine-tuning
β βββ api/
β βββ main.py # FastAPI application
β βββ models.py # Pydantic request/response schemas
βββ streamlit_app/
β βββ app.py # Main dashboard (multipage)
β βββ pages/
β βββ 1_π_Overview.py # Dataset stats & class distribution
β βββ 2_π¬_Classify.py # Live text classification demo
β βββ 3_π_Training_Metrics.py # Model performance charts
β βββ 4_π¬_Feature_Analysis.py # TF-IDF importance & vocabulary stats
βββ tests/ # 70+ tests with pytest
βββ docs/
β βββ CONCEPTS.md # Educational deep-dives
β βββ PROJECT_PLAN.md # Development roadmap
βββ train_distilbert_agnews.py # Fine-tune DistilBERT on AG News
βββ requirements.txt
βββ README.md
# 1. Clone and navigate
git clone https://github.com/mohamed-elkholy95/text-classification.git
cd text-classification
# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate # Linux/macOS
# .venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txt
# 4. Run tests
python -m pytest tests/ -v
# 5. Launch the Streamlit dashboard
streamlit run streamlit_app/app.py
# 6. (Optional) Start the API server
uvicorn src.api.main:app --host 0.0.0.0 --port 8002The FastAPI server exposes a REST API for programmatic text classification.
| Method | Endpoint | Description | Request Body | Response |
|---|---|---|---|---|
POST |
/predict |
Classify a single text | {"text": "..."} |
{"label": 1, "confidence": 0.92, ...} |
POST |
/predict_batch |
Classify multiple texts | {"texts": [...]} |
{"predictions": [...]} |
GET |
/health |
Health check | β | {"status": "ok"} |
GET |
/models |
List available models | β | {"models": ["nb", "lr", "svm", ...]} |
GET |
/metrics |
Latest evaluation metrics | β | Metrics dict |
# Example: classify a review
curl -X POST http://localhost:8002/predict \
-H "Content-Type: application/json" \
-d '{"text": "This product is absolutely amazing!"}'The interactive dashboard provides four pages:
| Page | Description |
|---|---|
| π Overview | Dataset statistics, class distribution, sample texts |
| π¬ Classify | Enter any text and see predictions from all models side-by-side |
| π Training Metrics | Accuracy, F1, precision/recall curves, ROC curves |
| π¬ Feature Analysis | Real TF-IDF importance from Logistic Regression coefficients, word frequencies, vocabulary statistics |
streamlit run streamlit_app/app.pyThe dashboard uses a dark theme and @st.cache_resource for efficient data loading and model caching.
# Run all tests
python -m pytest tests/ -v
# Run a specific test file
python -m pytest tests/test_evaluation.py -v
# Run with coverage
python -m pytest tests/ --cov=src --cov-report=term-missingThe test suite covers:
- Data pipeline: dataset loading, preprocessing, augmentation
- Feature extraction: TF-IDF, embeddings, feature combining
- Models: baseline classifiers, ensemble, transformer
- Evaluation: metrics (binary + multiclass), confusion matrix, per-class metrics, edge cases
- Advanced: calibration, cross-validation, learning curves
- API: endpoint validation with TestClient
Building this project teaches core ML concepts through hands-on implementation:
-
Feature Engineering for Text β Why TF-IDF outperforms raw bag-of-words, how n-grams capture context, and when dense embeddings are worth the cost.
-
Model Selection & Comparison β When to use Naive Bayes (fast, good baselines) vs. SVM (strong margins) vs. Logistic Regression (interpretable coefficients) vs. ensemble methods (best accuracy).
-
Evaluation Beyond Accuracy β Why F1 matters for imbalanced classes, how to read precision-recall curves, and why a confusion matrix reveals failure modes that aggregate metrics hide.
-
Confidence Calibration β Why raw model scores are not true probabilities, how Platt scaling and isotonic regression fix this, and when calibration matters (thresholding, risk-sensitive decisions).
-
Bias-Variance Diagnosis β How learning curves reveal underfitting (high bias) vs. overfitting (high variance), and how to fix each (more features vs. more data/regularization).
-
Cross-Validation β Why stratified k-fold gives more reliable estimates than a single split, and how to use it for model selection without leaking test data.
-
Ensemble Methods β Why combining diverse models reduces variance, how voting vs. stacking differ, and when ensembles are worth the complexity.
-
Production Considerations β API design with FastAPI, request tracing with X-Request-ID headers, Streamlit dashboards for non-technical stakeholders, and how to cache expensive computations.
-
Data Quality β Why validating datasets before training (null checks, class balance, duplicate detection) prevents more production bugs than any amount of hyperparameter tuning.
-
Model Persistence β How to serialize models with joblib, track provenance metadata, and build end-to-end inference pipelines that prevent training-serving skew.
Mohamed Elkholy β GitHub Β· melkholy@techmatrix.com