Skip to content

mohamed-elkholy95/text-classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

64 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏷️ Text Classification System

Multi-model text classification with TF-IDF, word embeddings, transformers, and ensemble methods β€” built as an educational portfolio project.

Python scikit-learn Streamlit FastAPI Tests


πŸ“ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                      TEXT CLASSIFICATION PIPELINE                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                 β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ Raw Data │───▢│  Preprocess  │───▢│  Feature Extract   β”‚    β”‚
β”‚  β”‚ (CSV/    β”‚    β”‚  - Clean     β”‚    β”‚  - TF-IDF          β”‚    β”‚
β”‚  β”‚  Synthetic)β”‚  β”‚  - Tokenize  β”‚    β”‚  - Embeddings      β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  - Encode    β”‚    β”‚  - Combined        β”‚    β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                                               β”‚                 β”‚
β”‚                  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚                  β”‚  Evaluation  │◀───│  Classifiers       β”‚    β”‚
β”‚                  β”‚  - Metrics   β”‚    β”‚  - Naive Bayes     β”‚    β”‚
β”‚                  β”‚  - Curves    β”‚    β”‚  - Logistic Reg    β”‚    β”‚
β”‚                  β”‚  - Calibratn β”‚    β”‚  - SVM             β”‚    β”‚
β”‚                  β”‚  - Report    β”‚    β”‚  - Random Forest   β”‚    β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  - Transformer     β”‚    β”‚
β”‚                         β”‚            β”‚  - Ensemble        β”‚    β”‚
β”‚                  β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚                  β”‚   Serving   β”‚                               β”‚
β”‚                  β”‚  - FastAPI  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚                  β”‚  - Streamlit│◀───│  Hyperparameter    β”‚    β”‚
β”‚                  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  Tuning (Grid)     β”‚    β”‚
β”‚                                     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Cross-Validation β”‚ Data Augmentation β”‚ Model Ensemblingβ”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🧠 Key Concepts

Concept Description
TF-IDF Term Frequency–Inverse Document Frequency converts text into numerical vectors. Words that appear often in one document but rarely across the corpus get high scores, making them discriminative features.
Naive Bayes Applies Bayes' theorem with a "naive" independence assumption between features. Despite the assumption rarely holding, it works remarkably well for text classification because word co-occurrence patterns still encode class information.
SVM (Support Vector Machine) Finds the optimal hyperplane that maximizes the margin between classes. Linear SVMs are especially effective for high-dimensional sparse data like TF-IDF vectors.
Logistic Regression Models the probability of class membership using a logistic function. Coefficients are directly interpretable as feature importance β€” perfect for understanding why the model makes each prediction.
Ensemble Methods Combine multiple weak learners into a stronger predictor. Voting averages predictions; stacking trains a meta-learner on base model outputs to capture complementary strengths.
Confidence Calibration Raw model scores aren't always true probabilities. Platt scaling (sigmoid) and isotonic regression transform scores so that "70% confident" actually means ~70% accuracy.
Cross-Validation Stratified k-fold CV splits data into k folds while preserving class ratios, giving a lower-variance estimate of generalization performance than a single train/test split.
Data Validation Checking for null values, empty strings, duplicates, and class imbalance before training prevents mysterious NaN losses and silently degraded accuracy β€” the most common source of ML bugs.
Model Persistence Serializing trained models with joblib enables instant inference at API startup without retraining. Companion metadata (timestamp, hyperparameters) supports experiment tracking and reproducibility.
Inference Pipeline Bundling preprocessing, feature extraction, and prediction into a single pipeline object prevents training-serving skew β€” the #1 cause of silent ML bugs in production.

πŸ“– See docs/CONCEPTS.md for deeper explanations of each concept.

πŸ“ Project Structure

text-classification/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ __init__.py
β”‚   β”œβ”€β”€ config.py                    # All configuration constants
β”‚   β”œβ”€β”€ evaluation.py                # Metrics, confusion matrix, reporting
β”‚   β”œβ”€β”€ calibration.py               # Platt scaling & isotonic regression
β”‚   β”œβ”€β”€ cross_validation.py          # Stratified k-fold CV
β”‚   β”œβ”€β”€ learning_curves.py           # Bias-variance diagnostics
β”‚   β”œβ”€β”€ model_comparison.py          # Side-by-side model benchmarking
β”‚   β”œβ”€β”€ text_analyzer.py             # Corpus-level text statistics
β”‚   β”œβ”€β”€ tuning.py                    # Grid search hyperparameter tuning
β”‚   β”œβ”€β”€ pipeline.py                  # End-to-end inference pipeline
β”‚   β”œβ”€β”€ persistence.py               # Model save/load with metadata tracking
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ dataset_loader.py        # Synthetic data & CSV loaders
β”‚   β”‚   β”œβ”€β”€ preprocessor.py          # Text cleaning, tokenization, label encoding
β”‚   β”‚   └── augmentor.py             # Synonym replacement, random augmentation
β”‚   β”œβ”€β”€ features/
β”‚   β”‚   β”œβ”€β”€ tfidf_features.py        # TF-IDF vectorizer with configurable n-grams
β”‚   β”‚   β”œβ”€β”€ embedding_features.py    # SVD-based dense word embeddings
β”‚   β”‚   └── feature_combiner.py      # Merge TF-IDF + embedding features
β”‚   β”œβ”€β”€ models/
β”‚   β”‚   β”œβ”€β”€ baseline_models.py       # NB, LR, SVM, Random Forest
β”‚   β”‚   β”œβ”€β”€ model_ensemble.py        # Voting & stacking ensembles
β”‚   β”‚   └── transformer_classifier.py # DistilBERT fine-tuning
β”‚   └── api/
β”‚       β”œβ”€β”€ main.py                  # FastAPI application
β”‚       └── models.py                # Pydantic request/response schemas
β”œβ”€β”€ streamlit_app/
β”‚   β”œβ”€β”€ app.py                       # Main dashboard (multipage)
β”‚   └── pages/
β”‚       β”œβ”€β”€ 1_πŸ“Š_Overview.py          # Dataset stats & class distribution
β”‚       β”œβ”€β”€ 2_πŸ’¬_Classify.py           # Live text classification demo
β”‚       β”œβ”€β”€ 3_πŸ“ˆ_Training_Metrics.py   # Model performance charts
β”‚       └── 4_πŸ”¬_Feature_Analysis.py  # TF-IDF importance & vocabulary stats
β”œβ”€β”€ tests/                            # 70+ tests with pytest
β”œβ”€β”€ docs/
β”‚   β”œβ”€β”€ CONCEPTS.md                  # Educational deep-dives
β”‚   └── PROJECT_PLAN.md              # Development roadmap
β”œβ”€β”€ train_distilbert_agnews.py        # Fine-tune DistilBERT on AG News
β”œβ”€β”€ requirements.txt
└── README.md

πŸš€ Quick Start

# 1. Clone and navigate
git clone https://github.com/mohamed-elkholy95/text-classification.git
cd text-classification

# 2. Create virtual environment
python -m venv .venv
source .venv/bin/activate    # Linux/macOS
# .venv\Scripts\activate     # Windows

# 3. Install dependencies
pip install -r requirements.txt

# 4. Run tests
python -m pytest tests/ -v

# 5. Launch the Streamlit dashboard
streamlit run streamlit_app/app.py

# 6. (Optional) Start the API server
uvicorn src.api.main:app --host 0.0.0.0 --port 8002

πŸ”Œ API Endpoints

The FastAPI server exposes a REST API for programmatic text classification.

Method Endpoint Description Request Body Response
POST /predict Classify a single text {"text": "..."} {"label": 1, "confidence": 0.92, ...}
POST /predict_batch Classify multiple texts {"texts": [...]} {"predictions": [...]}
GET /health Health check β€” {"status": "ok"}
GET /models List available models β€” {"models": ["nb", "lr", "svm", ...]}
GET /metrics Latest evaluation metrics β€” Metrics dict
# Example: classify a review
curl -X POST http://localhost:8002/predict \
  -H "Content-Type: application/json" \
  -d '{"text": "This product is absolutely amazing!"}'

πŸ“Š Streamlit Dashboard

The interactive dashboard provides four pages:

Page Description
πŸ“Š Overview Dataset statistics, class distribution, sample texts
πŸ’¬ Classify Enter any text and see predictions from all models side-by-side
πŸ“ˆ Training Metrics Accuracy, F1, precision/recall curves, ROC curves
πŸ”¬ Feature Analysis Real TF-IDF importance from Logistic Regression coefficients, word frequencies, vocabulary statistics
streamlit run streamlit_app/app.py

The dashboard uses a dark theme and @st.cache_resource for efficient data loading and model caching.

βœ… Testing

# Run all tests
python -m pytest tests/ -v

# Run a specific test file
python -m pytest tests/test_evaluation.py -v

# Run with coverage
python -m pytest tests/ --cov=src --cov-report=term-missing

The test suite covers:

  • Data pipeline: dataset loading, preprocessing, augmentation
  • Feature extraction: TF-IDF, embeddings, feature combining
  • Models: baseline classifiers, ensemble, transformer
  • Evaluation: metrics (binary + multiclass), confusion matrix, per-class metrics, edge cases
  • Advanced: calibration, cross-validation, learning curves
  • API: endpoint validation with TestClient

πŸ“š What You'll Learn

Building this project teaches core ML concepts through hands-on implementation:

  1. Feature Engineering for Text β€” Why TF-IDF outperforms raw bag-of-words, how n-grams capture context, and when dense embeddings are worth the cost.

  2. Model Selection & Comparison β€” When to use Naive Bayes (fast, good baselines) vs. SVM (strong margins) vs. Logistic Regression (interpretable coefficients) vs. ensemble methods (best accuracy).

  3. Evaluation Beyond Accuracy β€” Why F1 matters for imbalanced classes, how to read precision-recall curves, and why a confusion matrix reveals failure modes that aggregate metrics hide.

  4. Confidence Calibration β€” Why raw model scores are not true probabilities, how Platt scaling and isotonic regression fix this, and when calibration matters (thresholding, risk-sensitive decisions).

  5. Bias-Variance Diagnosis β€” How learning curves reveal underfitting (high bias) vs. overfitting (high variance), and how to fix each (more features vs. more data/regularization).

  6. Cross-Validation β€” Why stratified k-fold gives more reliable estimates than a single split, and how to use it for model selection without leaking test data.

  7. Ensemble Methods β€” Why combining diverse models reduces variance, how voting vs. stacking differ, and when ensembles are worth the complexity.

  8. Production Considerations β€” API design with FastAPI, request tracing with X-Request-ID headers, Streamlit dashboards for non-technical stakeholders, and how to cache expensive computations.

  9. Data Quality β€” Why validating datasets before training (null checks, class balance, duplicate detection) prevents more production bugs than any amount of hyperparameter tuning.

  10. Model Persistence β€” How to serialize models with joblib, track provenance metadata, and build end-to-end inference pipelines that prevent training-serving skew.

Author

Mohamed Elkholy β€” GitHub Β· melkholy@techmatrix.com

About

Multi-model text classification with TF-IDF, embeddings, SVM, logistic regression, transformer, and ensemble methods

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages